BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature
© Kuo et al. 2009
Published: 3 December 2009
Skip to main content
© Kuo et al. 2009
Published: 3 December 2009
To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its corresponding abbreviation from free text. The successful identification of abbreviation and its corresponding definition is not only a prerequisite to index terms of text databases to produce articles of related interests, but also a building block to improve existing gene mention tagging and gene normalization tools.
Our approach to abbreviation recognition (AR) is based on machine-learning, which exploits a novel set of rich features to learn rules from training data. Tested on the AB3P corpus, our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall, higher than the result achieved by the existing best AR performance system. We also annotated a new corpus of 1200 PubMed abstracts which was derived from BioCreative II gene normalization corpus. On our annotated corpus, our system achieved a F-score of 86.20% with 93.52% precision at 79.95% recall, which also outperforms all tested systems.
By applying our system to extract all short form-long form pairs from all available PubMed abstracts, we have constructed BIOADI. Mining BIOADI reveals many interesting trends of bio-medical research. Besides, we also provide an off-line AR software in the download section on http://bioagent.iis.sinica.edu.tw/BIOADI/.
Protein/gene name recognition (NR) [1, 2], is one of the most challenging tasks in biomedical text mining . Solving the problem of NR will allow for more complex text mining tasks to be addressed  as it is a prerequisite for information extraction and advanced text mining [3, 5, 6]. One of the main reasons of the challenging is high variation of terms that are not explicitly reflected in biomedical ontologies . It is common that biological entities can have several names. For example, PTEN and MMAC1 refers to the same entity . It was estimated that one-third of biological terms are variants .
A number of important studies in this area include GAPSCORE , which examines the appearance, morphology and context of named entities before applying a classifier trained using these features (59% precision and 50% recall). ABNER  employed a conditional random field model and achieved precisions between 58.2% to 85.4% and recall between 53.9% and 79.8% for different target entities. Other groups had attempted combinations of approaches to improve precision [12–16].
Abbreviation recognition (AR) is related to NR and can be considered as a pair recognition task of a terminology (may be a phrase or an entity) and its corresponding abbreviation from free text. In this manuscript, we denote "LF" to mean "the long form of the term" and "SF" to mean "the abbreviation or the short form of the term". Since the name of most protein and gene names are rather lengthy, most researchers tend to abbreviate their names in published manuscripts. As a result, AR can serve as a precursor of a number of applications. For example, building a term index of a text database to retrieve articles of related interests  or to link text-mined protein interaction networks [18–20]. Hence, it seems plausible to use AR as a first-pass in NER. In the simplest sense, AR may be used to assist term boundaries of entity names in free text, such as reported in [21, 22].
AR is generally considered as a simpler problem than NER and had been shown by the performance of AR systems . For example, Stanford University's Abbreviation Server [23, 24] demonstrated 97% precision at 22% recall and 95% precision at 75% recall. AbbRE  and the system by Schwartz et al.  achieved 96% precision with 70% recall, and 96% precision with 82% recall, respectively, while SaRAD system  reported 95% precision with 85% recall. More recently, Sohn et al.  used a LF to SF matching algorithm similar to Yu et al.  and reported 96.5% precision with 83.2% recall. However, these performance measures are hardly comparable because each system was tested on different corpora . Although both Chang et al.  and Schwartz et al.  used the Medstract Gold Standard Evaluation Corpus , each had made undisclosed modifications to their test corpus , resulting in difficulty in comparison. Nevertheless, Torii et al.  performed a meta-study to compare the results of a number of AR systems and found that the SF-LF identified by each system is generally consistent with previous reports. In general, these systems can achieve excellent precisions but still have plenty of room for improvement in terms of recall. Currently, Schwartz et al.  and Sohn et al.  demonstrated the best AR performance than other existing systems.
Schwartz et al.  used a 2-step algorithm for AR under the assumption that the SF-LF must exist in the same sentence. In the first step, identification of a possible SF-LF pair is initiated by the presence of a pair of brackets. It considered two cases - the LF is in the brackets or the SF is in the brackets. If it is likely that the SF is in the brackets, the second step is to search for the LF word boundaries in the sentence by morphological features. Sohn et al. also used brackets to initiate the process of AR but ignored a list of common bracket-delimited structures, such as "(p < 0.05)". This is followed by filtering the potential SF-LF pairs using a set of pre-defined rules.
Our approach to AR is based on machine-learning and exploits a novel set of rich features to describe properties of a potential SF-LF pair. In addition, the difference between our system and those of [26, 28] is that we can identify pairs with unused characters in the SF. For example, "CA5" and "CA V gene". Our system also outputs the prediction probability to indicate the confidence of each identified SF-LF pair. Tested on the AB3P corpus , our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall. We also annotated a corpus of 1200 PubMed abstracts which was derived from BioCreative II gene normalization dataset. On our corpus, our system achieved F-score of 86.20% with 93.52% precision at 79.95% recall. Comparing to existing available AR systems [26, 28], our system outperformed them on both corpora and performs about 14 times faster than the best AR performance system . All resources can be found on our website. By applying our system to extract all short form-long form pairs from all available PubMed abstracts, we have constructed BIOADI, the most comprehensive dictionary of biological abbreviations online. Mining BIOADI reveals many interesting trends of bio-medical research.
LF is in front of SF, and SF is in brackets or square brackets, e.g. "HSP (heat shock protein)";
SF is in front of LF, and LF is in brackets or square brackets, e.g. "heat shock protein (HSP)";
Both SF and LF are in brackets or square brackets and separated by comma or semi-colon, e.g." (HSP, heat shock protein)".
The SF-LF pairs adhered to one of these forms will be annotated as potential SF-LF pairs. The BIOADI corpus includes 1668 true SF-LF pairs and 145 synonym pairs which are marked with "//" in the beginning of each pair. The synonym pairs were not considered as valid SF-LF pairs and ignored in the following experiments. Meanwhile, We also used the AB3P corpus for performance evaluation. It contains 1221 true SF-LF pairs. Some of them are synonym pairs, however.
Both positive and negative instances were required for model training. In this study, annotated SF-LF pairs were used as positive instances in training data, and negative instances were automatically extracted from text. The extraction of potential SF-LF pairs was similar to the previous work . However, constraints on character lengths or word lengths of SFs were not set, but numbered list indicators (e.g., (a), (b), (1a), (1b), (I), (II)....) and common strings ("e.g.", "and"...) were filtered out. Potential SFs which do not contain any alphabetic character or contain certain symbols ("=", "%", ">" and "<") were excluded. A potential LF can be composed of up to ten consecutive words preceding a potential SF in the same sentence, or in brackets or in square brackets following a potential SF which means that there are at most ten potential LFs of a potential SF, of which one of them is correct. Each abstract was split into sentences by "sentence and paragraph breaker"  before the automatic AR process. All potential SF-LF pairs were checked for existence in the list of positive instances. If not, the pairs acted as negative instances in training data.
Before training and testing the model, it is a pre-requisite applying to transform the pair into the form of a feature vector. In order to construct features from raw data (potential SF-LF pairs extracted from the previous step), we defined four sets of features. The design of these features was originated from , inspired by the previous works [29, 34] and carefully selected in our tests. The detail is as the following:
String morphological features of SF and LF
Is the first letter of the string uppercase?
Is the first letter of the string lowercase?
Are all characters of the string all uppercase?
Are all characters of the string all lowercase?
Does the first word of the LF use the first letter of the SF (case-sensitive and insensitive)?
Is the first word of the LF a stop word (case-insensitive)?
Is the first word of the SF a stop word (case-sensitive)?
Does the string contain numbers?
Does the LF share the same numbers of the SF?
Does the string contain Greek alphabet?
Does the LF start with the SF?
Does the brackets or square brackets of the string pair well?
Is the number of stop words in the LF = 1, 2, 3, 4...?
Is the length (in tokens) of the string = 1, 2, 3, 4...?
Does the string contain certain punctuation symbol?
The character pattern of the SF: First, to convert each consecutive uppercase or lowercase characters to "A" or "a" depending on whether they are uppercase or lowercase. Second, to convert each consecutive digits to "1". Third, to prune off other characters. This is followed by matching the converted string to a specified pattern. For example, the SF "Rb1" matches the pattern "Aa1".
We used space and punctuations as delimiters to tokenize each potential LF into tokens. Each token acted as a binary feature to represent token information of the potential LF. We also applied token bi-grams as binary contextual features of the potential LF.
Numeric features between SF and LF
The number of characters of longest common subsequence of the SF-LF pair divided by the SF length (in characters) ;
Same as 1 but with the string consisting of the first character of all LF tokens (e.g. "protein kinase C" forms "PKC");
The size of sharing character set between the SF and the LF divided by the size of character set of the SF;
The size of character set of the SF divided by the SF length (in characters);
The shortest LF of the SF-LF pair extracted by Schwartz's AR system  that is equal to the LF;
Same as 5 but ignoring numbers of the SF and LF (e.g. "CA 5 gene" are transformed into "CA gene");
Same as 5 but reversing both the SF and LF (e.g. "CA 5 gene" are transformed into "eneg 5 AC");
Same as 7 but ignoring numbers of the SF and LF (e.g. "CA 5 gene" are transformed into "eneg AC");
Contextual features of SF-LF pair
We generated contextual information of each potential SF-LF pair from the tokens which precede the SF-LF pair and are limited two tokens at most. Those tokens acted as binary features respectively.
We also applied token bi-grams as binary contextual features of the SF-LF pair.
Number of features of each feature set and total number of all features generated in feature extraction
M + L + N + C
To test the performance of different learning algorithms in our feature set, we implemented four learning algorithms, including Support Vector Machine, Naïve Bayes, Logistic Regression and Monte-Carlo Sampling Logistic Regression. We took advantage of MALLET  to implement Naïve Bayes, Logistic Regression and Monte-Carlo Sampling Logistic Regression and LIBSVM  for SVM. In this study, LIBSVM was incorporated into MALLET to simplify the pipeline of experiments on various learning algorithms.
If the length of the SF (in characters) is equal to one, the length of the LF (in words) must not be large than one;
If the SF is equal to "s", the first letter of the LF must not be "S" or "s" (e.g. "substract(s)");
The brackets and parentheses of the SF and LF must pair well;
The LF cannot contain a semi-colon followed by a space;
The number of punctuations in the SF normalized by the length of the SF (in characters) must not be large than 0.5;
The pairs of bracket or parenthesis are at most two pairs;
The LF must not start with the SF;
The SF must not be a sequence or list indicator (e.g., (a), (b), (1a), (1b), (I), (II)....);
Since Sohn's and Schwartz's AR systems are available online, we were able to reproduce their systems at our local site. Generally speaking, we used them without any modification in the whole process of system evaluation and comparison. We only made a necessary modification in the part of input and output of Schwartz's system for handling the format style of the AB3P corpus.
In this study, we used a machine learning approach to SF-LF pair recognition instead of a rule-based approach [26–28, 38]. Four learning algorithms, Logistic Regression, Monte-Carlo Sampling Maximum Entropy, Support Vector Machine and Naïve Bayes, were tested.
Performance of various learning algorithms tested on the BIOADI corpus and the AB3P corpus
SVM (linear kernel)
SVM (RBF kernel)
Performance of logistic regression classifier trained with different feature sets and tested on the BIOADI corpus and the AB3P corpus
M + L
M + L + N
M + L + N + C
We compared our system to Schwartz's and Sohn's systems. Each system was trained with the AB3P corpus before tested them with the BIOADI corpus and vise versa. Medstract Gold Standard Evaluation Corpus for evaluation  was not used as past results with the corpus reported are all based on the different modification version annotated by each team .
Performance of the AR systems tested on the BIOADI corpus and the AB3P corpus
Figure 1 shows the results of three AR systems tested on the BIOADI corpus. In the figure of precision versus the training data size, at first the precision is below the lines of Schwartz's and Sohn's systems and fluctuates intensely. However, the range of the fluctuation decreases as the size increasing and the curve gets close to the line of the precision of Schwartz's system at 94%. There is not much difference of precision among these systems. In the figure of recall versus training data size, the recall of our system is higher than other systems' recall even when the system was trained by a small size of training data. The range of oscillation also decreases when the size of training data increases. The recall in the last iteration (trained by 600 abstracts) was three percent higher than Sohn's system and four percent higher than Schwartz's system. We observed the same trend in the figure of F-score versus the size of training data. At first, the performance was not much different among the systems. After using more training data, our system shows the advantage of machine learning based AR system by improving its performance.
Figure 2 shows the same experiment but using the AB3P corpus for testing. In this case, our system also performed better in recall and F-score. Unexpectedly, in this case, a difference between the results obtained from the corpus is that our precision was better than Schwartz's system. It indicates that training with a consistently annotated corpus (i.e., BIOADI corpus) is useful to improve AR performance.
In addition, to test the influence of data size in a large set of training data, we merged both corpora to form a new dataset which contains 2450 unique abstracts. Figure 3 shows that the trends in precision, recall and F-score were similar to the results in Figure 1 and Figure 2. As the data size increases, it is expected that our machine learning-based approach will continue to improve while no improvement will be observed for rule-based systems.
Testing time (in seconds) of three AR systems testing on different size of PubMed abstracts
"PPIs|pump inhibitors" rather than "PPIs|proton pump inhibitors"
"CCR5|chemokine receptor 5" rather than "CCR5|C-C chemokine receptor 5"
Other common types of the pairs missed by our system include out of order match (e.g., NGL-1|netrin-G1 ligand) and partial match (e.g., Pol II|RNA polymerase II). There is another type of false-negative pairs reported [26, 28], which are with unused characters in the SF. Most of them has been removed in the process of AR for false-positive reduction. However, we kept them and identified them with model prediction correctly. For example, our system can identify:
"CA5|CA V gene"
"FTH1|ferritin heavy-chain gene"
Top 20 most frequent SF-LF pairs extracted from 17,551,165 PubMed abstracts
polymerase chain reaction
human immunodeficiency virus
magnetic resonance imaging
central nervous system
body mass index
protein kinase C
enzyme-linked immunosorbent assay
hepatitis C virus
Top 5 long-form occurrences for the abbreviation "APC"
adenomatous polyposis coli
activated protein C
argon plasma coagulation
Our web site http://bioagent.iis.sinica.edu.tw/BIOADI/ provides freely access to two online services and two off-line tools. Online services include (1) SF-LF Search Service and (2) SF-LF Identification Service, whereas off-line tools include (1) an off-line abbreviation recognition tool and (2) an abstract fetching script. SF-LF Search Service helps users to quickly retrieve all of the SF-LF pairs in our database. Query results are listed as 20 records per page and ordered by the number of PubMed IDs of each pairs so that users can easily find out the most popular ones. To see in which PubMed IDs the SF-LF pair can be found, users can click on the document picture under the "PubMed" column to generate a "PubMed ID box." For those SF-LF pairs having too many PubMed IDs to be fully displayed in the box, users can click on the "PubMed Resource" to see the whole list of PubMed IDs. By using the search service, users can find different subtypes of SFs or LFs and thereby come upon extra PubMed IDs that they can not find through regular literature search. Secondly, "SF-LF Identification service" provides real-time AR service. In the identification service section, users can use the text area for text inputs such as abstracts or manuscripts, whereas another input box is for PubMed ID. After receiving the submission of inputs, the system will return identified SF-LF pairs and scores for each pairs. The higher the score is, the better the identification can be trusted. If the input is a PubMed ID, the result table will also show a hyperlink to PubMed at the bottom. In the download section, we provide all of the SF-LF pairs in our database with two helpful tools: (1) Off-line Abbreviation Recognition Tool and (2) Abstract Fetching Script. The off-line abbreviation recognition tool can automatically do AR on a given text file and generate SF-LF pairs and their score to the output file. Since it is a JAVA application, it can run on any platform and serve as a SF-LF identification component in any pipeline of analysis. The abstract fetching script is a Perl script and can be used to massively download abstracts through the Web Service of PubMed database for given PubMed IDs. To ensure the stability of our web site, all scripts and layout of the web site have passed tests on different browsers, different platforms, and even mobile devices.
Our system demonstrated 93.5% precision and 80.0% recall, giving a F-score of 86.2%, which statistically significantly outperformed the existing best performance AR system. At the same time, our system runs sufficiently fast to handle the entire set of PubMed abstracts. This suggests that a machine learning approach to abbreviation recognition gives not only good performance as good as a rule-based system, but also satisfying execution.
This work is supported in part by the National Research Program in Genomic Medicine (NRPGM), NSC, Taiwan, under Grant No. NSC97-3112-B-001-024 (Advanced Bioinformatics Core).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S15.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.