Multi-stage gene normalization for full-text articles with context-based species filtering for dynamic dictionary entry selection
© Tsai and Lai; licensee BioMed Central Ltd. 2011
Published: 3 October 2011
Skip to main content
© Tsai and Lai; licensee BioMed Central Ltd. 2011
Published: 3 October 2011
Gene normalization (GN) is the task of identifying the unique database IDs of genes and proteins in literature. The best-known public competition of GN systems is the GN task of the BioCreative challenge, which has been held four times since 2003. The last two BioCreatives, II.5 & III, had two significant differences from earlier tasks: firstly, they provided full-length articles in addition to abstracts; and secondly, they included multiple species without providing species ID information. Full papers introduce more complex targets for GN processing, while the inclusion of multiple species vastly increases the potential size of dictionaries needed for GN. BioCreative III GN uses Threshold Average Precision at a median of k errors per query (TAP-k), a new measure closely related to the well-known average precision, but also reflecting the reliability of the score provided by each GN system.
To use full-paper text, we employed a multi-stage GN algorithm and a ranking method which exploit information in different sections and parts of a paper. To handle the inclusion of multiple unknown species, we developed two context-based dynamic strategies to select dictionary entries related to the species that appear in the paper—section-wide and article-wide context. Our originally submitted BioCreative III system uses a static dictionary containing only the most common species entries. It already exceeds the BioCreative III average team performance by at least 24% in every evaluation. However, using our proposed dynamic dictionary strategies, we were able to further improve TAP-5, TAP-10, and TAP-20 by 16.47%, 13.57% and 6.01%, respectively in the Gold 50 test set. Our best dynamic strategy outperforms the best BioCreative III systems in TAP-10 on the Silver 50 test set and in TAP-5 on the Silver 507 set.
Our experimental results demonstrate the superiority of our proposed dynamic dictionary selection strategies over our original static strategy and most BioCreative III participant systems. Section-wide dynamic strategy is preferred because it achieves very similar TAP-k scores to article-wide dynamic strategy but it is more efficient.
Gene normalization (GN) is the task of identifying the unique database IDs of genes and proteins found in literature. Even for trained biologists, GN is a difficult task that presents several problems making association with the correct ID number difficult. For one, gene and protein names often have several spelling variations or abbreviations. In other instances, gene products are described indirectly in a phrase, rather than being referred to by a specific name or code.
In many regards, the GN tasks of BioCreative II.5 & III are similar to those of previous BioCreative [1, 2] workshops. However, they have two significant differences: firstly, they provide full-length articles in addition to abstracts; and secondly, instead of being human species-specific, they include multiple species and provide no species ID information. Both changes bring the BioCreative GN task closer to real-world curation of a model organism database.
The first difference, full-text articles, introduces more complex targets for GN processing. Unlike abstracts, full text articles contain many parts and sections, including the main freetext sections (introduction, methods, etc.), metadata, figure/table captions, notes, and so on. Each section or part has its own characteristics which we can use to guide GN and the ranking algorithm. For example, the Introduction section often contains information that repeatedly appears throughout the article (key genes), while the Results section presents new scientific findings, such as PPIs. Extracting a PPI from the Results section may require resolving an acronym whose full name has only been mentioned in the Introduction section. To exploit this type of section-specific information, we have developed a multi-stage memory-based GN procedure and a ranking method.
Predictably, the second difference, inclusion of multiple species, increases inter-species ambiguity. One gene name, abbreviation or code may refer to genes in multiple species, each with its own unique ID, or even to multiple genes in the same species or across different species. For example, without context, a search for ‘tumor protein p53, TP53’ in Entrez Gene may return results for proteins with the same name in over 20 species. Since the species in the context is unknown, all entries in the gene name dictionary must be loaded for GN. Currently, EntrezGene is the largest and most widely used publicly available gene or gene product database and has the best coverage of names and species. However if the billions of names that it contains are all loaded for GN, it greatly slows down the GN process.
Our GN system is designed to deal with the two changes above. To utilize the characteristics of different sections of a full-length paper, we use a three-stage GN procedure (see Methods section for details). In summary, the procedure is carried out starting from the sections with the richest context information (introduction) to those with the poorest. For our purposes, the informationally richest sections are those that are most likely to mention a gene’s full name . Therefore, the introduction section is usually the richest section because it is here that authors first mention the genes of interest, giving their full names often followed by abbreviations used thereafter. The informationally poorest sections tend to be figure/table captions, which lack context information. Identifiers normalized in richer parts are used to help GN in poorer parts.
To handle the inclusion of multiple unknown species, we reduce ambiguity by dynamically selecting relevant entries from the dictionary for each paper or section and by employing an ID ranking model that sorts all genes in the paper according to confidence of correct normalization. By including species context features in the ranking model, we can improve inter-species accuracy. Many similar approaches have been proposed and proven effective [4, 5]. BioCreative III gene normalization task data is used to evaluate our proposed strategies.
After preprocessing, the multi-stage GN procedure is executed (Figure 1: stage 1 to 3). This method refines single-sentence-based GN by using section-specific information, scanning the whole article from the informationally richest to poorest sections—i.e. from the introduction section to table/figure captions.
The final step is ranking all normalized identifiers in a paper. We formulated the ranking problem as a support vector machine (SVM) classification problem, incorporating the confidence of the normalized identifiers and context information as features.
In the following sections, we explain the above steps in details and illustrate our strategies for selecting gene name dictionary entries for GN.
Three main subtasks are involved in our sentence-based GN method: gene mention recognition (GMR), dictionary matching, and disambiguation processing.
The recognition of gene names is handled by a machine-learning (ML)-based gene mention tagger  trained on the BioCreative II gene mention dataset . The GMR problem is formulated as a word-by-word sequence labeling task, where the assigned tags delimit the boundaries of any gene names. The underlying ML model is the conditional random fields  model with a set of features selected by a sequential forward search algorithm .
After GMR, we employ several post-processing rules developed in our previous work  to identify more gene mentions. For instance, if a parenthesized phrase follows an identified gene mention, we also regard the contents of the parentheses as a gene mention. The keywords, abbreviations, and full names recorded in the metadata are also used to adjust the gene mention boundary if the gene name string is a substring of them and vice versa. Take the sentence “Interaction between fortilin and transforming growth factor-beta GENE stimulated clone-22 (TSC-22) prevents apoptosis via the destabilization of TSC-22” as an example. The metadata stores the information that “transforming growth factor-beta stimulated clone-22” is the full name of “TSC-22”. Our GM tagger recognizes “transforming growth factor-beta” as a gene which is a substring of the full name stored in the metadata. As a result, the boundary is extended to include “stimulated clone-22”. The original string before adjustment is also stored in the metadata, which is checked when the adjusted gene name cannot be successfully mapped (in this example, the original string “transforming growth factor-beta” is also stored).
The recognized gene names are finally examined against a blacklist to filter out false positives. The list is automatically compiled from two databases, MeSH (for diseases), and HyperCLDB (for cell lines) , and the website NEB (for restriction enzymes) . Our blacklist contains about 65,000 terms. When processing each article, our system dynamically updates the blacklist with synonyms (full names or abbreviations) according to the full-name/abbreviation mapping in the article metadata.
Dictionary-matching is able to assign candidate identifiers to each recognized gene mention. Two matching strategies are employed. The first uses a dictionary compiled by collecting gene names in EntrezGene and generating their orthographical variants. Each recognized gene mention is looked up in the dictionary. If an exact match is found, then the gene is assigned that entry’s ID. Because all these terms are indexed by the Lucene search engine, we can then use the engine to find partial matches for each recognized gene mention.
If a gene mention is assigned two or more gene identifiers, we must determine which is more appropriate through disambiguation processing.
The goal of disambiguation is to select the most likely gene identifier from multiple gene identifiers which share the same gene name. We manually constructed several rule-based classifiers which use context information, such as chromosome location, sequence length and so on, to determine the given identifier’s label. Each classification rule follows this general form:
r: (Condition) → y × w
S(id) refers to the species keywords of id
C(id) refers to the cell line keywords of id
PPI(id) refers to the interaction partner of id
FN(id) refers to the gene mention’s full name (its identifier is id)
T(id) refers to the tissue keywords of id
D(id) refers to the domain keywords of id
F(id) refers to the family keywords of id
M(id) refers to the MASS of id
GO(id) refers to the GO terms of id
Chromosome Location e
CL(id) refers to the chromosome locations of id
Sequence Length d
SL(id) refers to the sequence lengths of id
RS Number d
R(id) refers to the RS number of id
The id refers to an identifier from the ambiguous list.
The nid refers to a successfully normalized identifier stored in the metadata.
Our three-stage GN procedure is shown in Figure 1.
In the first stage, GN is executed in the following order: Introduction, Abstract, Title. Successfully normalized identifiers are kept in memory (the metadata) for use in subsequent sections. We process the Introduction section first because the Abstract and Title sections are more concise and contain less contextual information and fewer identifiers. Following the order above, certain classifiers, including the PPI, Full-name/Acronym and the History classifier, are more effective. Take the PPI classifier for example. The classifier uses a gene's PPI information to disambiguate identifiers. As shown in Table 1, it requires a normalized identifier, nid, stored in the metadata. For each ambiguous gene identifier id the classifier checks whether id – nid is a PPI pair recorded in HPRD or not. If we process the article in a linear order (Title→Abstract →Introduction), the value of the PPI classifier will always be 0 when processing the Title (the same applies to the Full-name/Acronym and History classifiers). The values of other classifiers also tend to be 0 because of the lack of context information.
In this stage, the successfully normalized gene mentions and corresponding identifiers are extracted from the metadata to generate a dictionary. We then search the whole article for mentions in this dictionary. The Title, Abstract, and Introduction sections are also rechecked in case GMR missed any instances. When tagging gene mentions outside the Title, Abstract, and Introduction sections, the dictionary-based tagger also checks species keywords in the same sentence. If keywords are found and matched with the corresponding ID’s species, the ID is assigned. Otherwise, the tagger checks the metadata to see which species is the focus of the paper and assigns this to the mention. The focus species is determined by calculating the frequencies of the species keywords. The most frequent species is chosen as the focus species and is stored in the metadata.
Compared to directly employing a full list of gene names as a dictionary to annotate the whole article, this procedure can reduce the number of false positives . It can also improve gene normalization accuracy in sections outside the Introduction section because an abbreviation’s full name can usually be found in the Introduction section.
In this stage, each normalized identifier from stage three is ranked by an SVM  classifier. For each identifier, the corresponding information stored in memory is used to extract features. In the following section, we describe the extracted features for gene identifier ranking.
As mentioned before, there are two matching strategies to generate identifiers in our system: exact and partial matching. They are represented as Boolean features.
The value of the weighted vote generated by our disambiguation process is used as a feature. In addition, 13 Boolean features, which indicate whether or not the corresponding GN Classifier listed in Table 1 votes for the identifier, are also used as features.
The frequency with which the ID appears in the entire article is used as a feature. In addition, based on the work of McIntosh and Curran, who found that molecular interaction descriptions usually appear in the Results section, we added the percentage of an ID found in the Results section as a feature.
Location in full text article
Among the last n 1 a sentences in the abstract
The first section (usually the introduction section)
Among the last n 2 a sentences in the first section
The Results section
The other sections
The last section (usually the conclusion section)
Section, sub-section or paragraph titles
Known information feature sets.
A Boolean feature which indicates whether or not the identifier’s gene name matches keywords.
Full name/abbreviation match
A Boolean feature which indicates whether or not the identifier’s gene name matches full names or abbreviations.
Most ambiguity in the GN process comes from the large number of existing gene names in dictionaries and the even larger number that results from the expansion of those original names. Inclusion of multiple species greatly compounds this complexity. Limiting gene dictionary size or excluding certain species’ genes may lessen the ambiguity and improve efficiency, but it may also lose crucial data. We propose two types of strategies for selecting relevant gene dictionary entries, static and dynamic.
Using a static strategy, the same set of terms is used in performing GN for every article. The sample static strategy that we designed for this paper uses only gene names from the 22 most common species in NCBI (from 7283 species).
In the dynamic strategy, we use varying sets of names chosen according to the species context. The context can range from a sentence or paragraph to a whole section or even article, but in our system we only implement the latter two. We use two methods to detect the species in the context. The first is a keyword-based approach, which employs regular expressions to check for UniProt species keywords in the given section or article. If we identify keywords for certain species, we check only entries belonging to those species when performing GN.
BioCreative III participants were given a collection of training data that contains 32 full-text articles annotated by a group of experienced curators invited from various model organism databases. The articles are available in XML from selected journals in PubMed Central. A list of normalized EntrezGene IDs is provided for each article in the set.
Species distribution across data sets
Training Set (32 articles)
Test Set (50 articles)
Test Set (507 articles)
Enterobacter sp.638 (23%)
S.pneumoniae TIGR4 (9%)
S.cerevisiae S228c (6%)
Enterobacter sp.638 (4%)
M.oryzae 70-15 (4%)
S.pneumoniae TIGR4 (2%)
E.histolytica HM-l (2%)
S.scrofa (1 %)
Other 18 species (9%)
Other 65 species (23%)
Other 91 species (7%)
To measure the overall retrieval efficacy for several sample queries, , the average of the TAP, , over all queries is adopted.
An E-value threshold E 0 is determined to mirror a user's tolerance for errors. Assume that a user tolerates about k EPQ, k being some arbitrary integer. BioCreative III GN task gives k = 5, 10, 20 as an arbitrary but not unreasonable estimate of a tolerable EPQ. Determine the smallest E-value E k (A) corresponding to a median number of k EPQ over all queries q for a given system A. Thus, for any E-value threshold larger than E k (A), at least 50% of the queries have at least k errors. Each system's E-value predicts the actual number of EPQ with varying accuracy, so the threshold E k (A) depends on the algorithm A. With the same median kEPQ, all algorithms have the same specificity. With their specificities fixed at the same value, their sensitivities are on an equal footing, and therefore comparable. In summary, BioCreative III’s measure of overall GN efficacy is , the (query-averaged) TAP-k for a median k EPQ (the ‘TAP-k’), i.e. it is the average over all queries of Equation (1) with E 0 = E k (A).
Our strategies vs. BioCreative III participant average on gold-50 test set
Test set gold standard 50
Optimal Dynamic Dictionary
BioCreative III average vs. Static vs. Section-wide vs. BioCreative III top systems on silver test set
Test set silver standard 50
Test set silver standard 507
BioCreative III Average (Baseline)
In the first and the second rows of Table 5, we compare the scores of our static strategy, which uses only the most common species, to the average scores of the BioCreative III participants on the test set Gold 50. Our static strategy, which is our overall best performer on BioCreative III, exceeds the BioCreative III average by at least 24% in every evaluation. According to the BioCreative III GN task overview paper, our static strategy consistently remains in the top tier group in all evaluations .
The first and second rows of Table 6 show the results of the same configurations on the silver test set. Comparing the results with Table 5, we observe that its margins in the Gold 50 test set (24%-35%) are almost half of those in the Silver 507 test set (40%-50%). We believe this is because the majority of the most frequent species in the Silver 507 are among the 22 most common species in UniProt. On the other hand, only two of the top-10 species in the Gold 50 test set are among UniProt’s 22 most common species. This inspired us to try dynamic strategies to select relevant dictionary entries for context-specific normalization.
Rows 3-5 of Table 5 shows the results of different strategies employed on the Gold 50. The first and second configurations (rows 3 and 4 of Table 5) dynamically enable dictionary entries based on whole-article or section context, respectively. Lastly, we show the best performance that could ideally be achieved by using a dynamic strategy with our GN system (row 5 of Table 5). We construct the ideal system as follows: For each article A, we find the species mentioned in A by checking each ID’s species in the gold standard ID list corresponding to A. For example, gene ID 10211 is found in article PMC2858709’s gold standard ID list. Gene ID 10211 belongs to Taxonomy ID 9606. Therefore, we know that this article mentions Taxonomy ID 9606.
As we can see in Table 5, both dynamic-strategy configurations increase all TAP-k scores by similar margins. In Tap-5, 10, and 20, they outperform the static baseline by about 17%, 13% and 6%, respectively. As k increases, the improvement margin of dynamic over static strategy decreases. This may indicate that more IDs can be correctly normalized in the beginning of the returned gene list after including the dictionary entries belonging to the context species in addition to the top-22 most common species. Take article PMC2887456 for example. Nad7 (ID:3800099) and EXPB11 (ID:778389) cannot be normalized because their species (Triticum aestivum, Taxnomy ID:4565) is not included in the top-22. However, using a dynamic strategy, the gene names corresponding to Triticum aestivum are included, and these two genes are correctly normalized and ranked as 3rd and 8th. The TAP-5 score for this article is improved by 0.1944. The dynamic strategy can identify those gene IDs whose context information is rich but whose corresponding species is uncommon. If their dictionary entries are included, they can usually be correctly ranked in the front of the list, which affects Tap-5 more than Tap-10 or 20 and explains why as the k value increases, the advantage of a dynamic over a static strategy decreases.
As mentioned above, article-wide and section-wide contexts achieve very similar TAP-k scores. Consider the average normalization ambiguity in the test set: when using article-wide context, one gene name matches 2.5 IDs, while when using section-wide context, one gene name matches 1.7 IDs on average. When normalizing every occurrence of one gene in a given article using article-wide and section-wide contexts, 260,412 and 107,205 dictionary entries are enabled on average, respectively. Obviously, using section-wide context is more efficient.
Row 5 of Table 5 shows that the optimal dynamic strategy outperforms the proposed dynamic strategies by a significant margin. This implies that our GN system’s TAP score could be further improved with a better species identification system.
Employing the dynamic strategies on the silver test set also shows effectiveness. In Table 6, we can see that using the section-wide dynamic strategy, our GN system outperforms the best BioCreative III system in TAP-10 on the Silver 50 test set and in TAP-5 on the Silver 507 set. According to , using the silver standard allows GN developers to assess systems on the entire set of test articles without human annotation. This increases our confidence in the superiority of our proposed dynamic strategies over our original static strategy and most BioCreative III participating systems.
After analyzing our dynamic-strategy system’s results on the gold standard 50 dataset, we found that gene mentions belonging to rare species are often incorrectly associated with IDs belonging to popular species (such as human and rats). This is because our disambiguation process boosts the scores of IDs whose species information are found in the context. Since popular species’ keywords appear more frequently than those of rare species, IDs belonging to popular species are more likely to be selected.
Another problem is caused by the inability of our GMR system to recognize genes belonging to rare species with distinct nomenclature (naming rules). We may be able to improve GMR in this regard by first generating pattern-based rules from names of more popular species using a local alignment algorithm such as Smith-Waterman.
With recent advances in text-mining technology and increasing availability of full-text articles online, text mining can be carried out on full papers rather than just abstracts to expand and enrich automated literature curation. After an article has been selected for curation, a preliminary step is to list genes or proteins of interest in the article. While the concept is very simple, the task is very difficult to automate. In this paper, we present a multi-stage GN algorithm and SVM-based ranking method that we submitted to BioCreative III GN. We make use of the different characteristics of each paper section in our GN system. In addition, we propose two types of strategies for selecting dictionary entries for GN.
We have demonstrated that the static strategy that we submitted to BioCreative III, which uses only the most common species, exceeds the BioCreative III average by at least 24% in every evaluation. Examining this strategy’s much poorer performance in the Gold 50 test set, we noticed that most false negative IDs were of rare species. To improve identification of such species, we decided to try dynamic strategies to select relevant entries from the dictionary according to article-wide or section-wide species context. Our new approaches improved TAP-k scores by up to 17% in the Gold 50 test set. Our best dynamic strategy achieves comparable performance to the best BioCreative III systems in the silver-standard evaluation sets. These results demonstrate the superiority of our proposed dynamic strategies over our original static strategy and most BioCreative III participant systems. Section-wide dynamic strategy is preferred because it achieves very similar TAP-k scores to article-wide dynamic strategy but is more efficient. Comparison of our best results with an optimal configuration for which all species were verified manually shows that our GN system’s TAP score could be further improved with a better context-based species identification module.
This research was supported in part by the National Science Council under grant NSC 98-2221-E-155-060-MY3. We especially thank the BioCreative organizers and BMC reviewers for their valuable comments, which helped us improve the quality of the paper
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 8, 2011: The Third BioCreative – Critical Assessment of Information Extraction in Biology Challenge. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S8.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.