Exploiting and integrating rich features for biological literature classification
© Wang et al.; licensee BioMed Central Ltd. 2008
Published: 11 April 2008
Efficient features play an important role in automated text classification, which definitely facilitates the access of large-scale data. In the bioscience field, biological structures and terminologies are described by a large number of features; domain dependent features would significantly improve the classification performance. How to effectively select and integrate different types of features to improve the biological literature classification performance is the major issue studied in this paper.
To efficiently classify the biological literatures, we propose a novel feature value schema TF*ML, features covering from lower level domain independent “string feature” to higher level domain dependent “semantic template feature”, and proper integrations among the features. Compared to our previous approaches, the performance is improved in terms of AUC and F-Score by 11.5% and 8.8% respectively, and outperforms the best performance achieved in BioCreAtIvE 2006.
Different types of features possess different discriminative capabilities in literature classification; proper integration of domain independent and dependent features would significantly improve the performance and overcome the over-fitting on data distribution.
In the general text classification, effective feature is essential to make the learning task more efficient and accurate. No degree of classifiers can make up for a lack of predictive information in the input features . In bioscientific literature, where biological structures and terminologies are described in a large number of features, the situation is more serious: well-chosen features could improve the classification accuracy substantially and decrease the risk of over-fitting .
In the early days of biological literature classification study, most of the researchers depended on the domain experts to pick out the informative features. Regev et al. used expert-defined rules to extract features from the semi-structure text and figure legends. Besides, they utilized external lexical resources and semantic constraints to achieve a better coverage and accuracy . Min Shi et al. employed two types of keywords as feature: one type was from the given evidences and the other type was manually extracted from the training texts by domain experts . Moustafa M. Ghanem et al. utilized expert-edited regular expressions to capture frequently occurring keyword combinations (or motifs) within short segments of the text in a document . All these approaches require the involvement of domain experts in identifying the specific textual objects and the informative templates, so that they cannot easily be automatically extended to an efficient and scale-free model on other biological datasets .
Recent years, fully automatic and scalable text classification algorithm provides an alternative to the previous methods. Wilbur employed unigram, bigram and all of the MeSH terms as the set of feature to represent the documents . Dobrokhotv et al. utilized the words processed by the XEROX natural language processing tool as discriminating attributes . Aaron et al. used “Bag of Words” model: content was tokenized and stemmed into unigram feature and modelled the samples as binary feature vectors .
Although all of these features catch some aspects of biological and statistical meanings, they still cannot well and automatically exploit the domain dependent information from the complex biological literature. It becomes a challenge in biological text mining field to automatically introduce higher level domain dependent features into the classification process and integrate with the lower level domain independent features.
In this paper, we investigate the issue of biological literature classification from the perspective of feature selection and integration, which is evaluated by BioCreAtIvE , an international evaluation in biological text mining. In IAS (Protein Interaction Article Sub-task) of BioCreAtIvE 2006, participants were asked to classify a given set of MEDLINE titles and abstracts, according to whether a document contains at least one physical PPI (Protein Protein Interaction) or not. This procedure would be extremely useful for facilitating the efficiency of manual curation since it will largely filter out the irrelevant documents. In the evaluation, one of our implemented classifiers achieved outstanding results: the Accuracy ranked at the 1st place, AUC and F-Score ranked at the 2nd place respectively.
where x is the word and phrase features employed in IAS, P(x) and Q(x) are the probability of x in the training and testing set respectively.
KL Divergence on Training, Cross Validation and Testing Set
Training Set Vs Cross Validation Set
Training Set Vs Testing Set
The rest of the paper is organized as follows. We will introduce the detailed description of methodologies proposed in this paper in Methods section. In Results and discussion section, we will present the experiment results and analysis. In Conclusion section, we will summarize our contributions in this paper.
In this paper, we are engaged to investigate the issue from the perspective of feature selection and integration. The main contribution in this paper lies in that: we propose 1). domain independent feature value schema TF*ML and length-fixed string feature 2). domain dependent “semantic template” feature 3). efficient integrations among the features. These methods are described respectively in the following.
where t means the selected feature word, c+ and c− mean the relevant and irrelevant category, P(t|c + ) and P(t|c − ) mean the probability that t occurs in category c+and c− respectively.
The sign of ML indicates the category relevance of the feature and the magnitude reflects the classification confidence. Following the same idea as TF*IDF to express the specificities of features in different documents, we also multiply TF by ML.
In the ML schema, the relevant/irrelevant document rate has been taken into consideration as a compensate factor. But in the posterior probability schema, the impact of relevant/irrelevant document rate is eliminated, according to the independent and identical distribution hypothesis, while it is not tenable in our situation (since the different relevant/irrelevant rate between training and testing set).
TF*ML Feature Value Schema.
The Precision/Recall/F-Score demonstrate classification capability of the model, and AUC (area under receiving operator characteristic curve) is to evaluate ranking capability of the model.
In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words . Especially in bioscience, the tokenizing and stemming procedure would incur undesired loss of the informative attributions, since many of the semantically related biomedical terms that share the same stem or morpheme are often not reducible to the same stems . Therefore, we propose to directly utilize the length-fixed strings as feature to exploit most of the informative segments.
KL Divergence on Training, Cross Validation and Testing Set
Training Set Vs Cross Validation Set
Training Set Vs Testing Set
Top 10 Unigram Features and String Features ‘_’ means a white space
Named entities and semantic template features
Both of the above proposed methods are domain independent, which are endowed with well generalization capacity and are not necessarily limited to the bioscience domain. But introducing domain dependent features could greatly filter out the false positive samples and further improve the performance . In biological literatures, named entities (words and phrases belonging to certain predefined classes, e.g. protein and gene), such as CDC42 (protein), and semantic templates (co-occurrences of a pre-specified type of relationship between entities of given types), such as “ProteinA interact with ProteinB”, are the most meaningful concepts in PPI documents and well conserve the syntactic and semantic structures in describing the protein interactions. So we introduce the named entities and semantic templates as feature to exploit the domain dependent information.
With the help of ABNER , a named entity recognition tool, 5 types of named entities in a given document could be identified: protein, DNA, RNA, cell types and cell line. Since the recognized entity space is large and sparse, we only utilize their types as feature to decrease the dimension of feature space without losing the universality.
After recognizing the named entities, semantic templates are ready to be extracted from the documents. We propose a novel template extraction algorithm named KeyBT, i.e. Key word B ased T emplate extraction algorithm, to extract the semantic templates describing the interaction patterns among all of the recognized entities.
Compared to the traditional local alignment algorithm, KeyBT operates differently: first locate statistical significant words as seeds, and expand the seeds in the contextual environment iteratively, finally preserve the most “powerful” templates as the result.
Locate the occurrences of predefined candidate keywords in each sentence; discard the sentences without any keywords; get the initial candidate sentence set S0;
Locate each entity type in S0; discard the sentences without any entities; get the initial candidate template set T0;
Iteratively normalize each template in T0: removing the redundant templates by syntax parsing; get the raw templates set T1;
Evaluating the templates in T1, filter out the templates of low quality, get the final template set Tf.
KeyBT not only depends on Chi-Square statistics to select the most distinct keywords but also utilizes ML to determine the category relevance of the keywords, because Chi-Square does not distinguish the association between features and different categories: a few high quality features of irrelevant category might be overwhelmed in the large amount of features of relevant category. Chi-Square is employed to select a raw candidate keyword list (with low threshold), and then top 50 features from both categories are preserved according to ML respectively.
where t.pos and t.neg are the positive/negative matching count of template t in the training set, and β is the parameter tuning the positive/negative matching rate.
When we get the final templates set Tf, we do not simply depend on the positive/negative matching rate of each template to make the prediction. Instead, we use them to build feature vectors and train a classifier.
<PTN>, <DNA>, <CEL> mean protein, DNA and cell-line, E* means any words occurrence
<PTN> E* <DNA> E* association E* <PTN>
<PTN> E* bind E* <DNA>
<PTN> E* interact E* <PTN>
<PTN> E* colocalize E* <CEL>
<PTN> E* contact E* <DNA> E* <PTN>
Compared with the local alignment algorithm that depends on the post evaluation to remove meaningless and noisy templates, the potential advantages of KeyBT algorithm are as follows: 1) KeyBT utilizes the statistical characteristic of the candidate keywords to largely remove noise before extraction; 2) KeyBT templates need not to fix the entities' type beforehand, so that it could catch the distribution of templates in both categories to discriminate both of the relevant and irrelevant categories; 3) the heuristic rules applied on the relation of named entities and candidate words (such as their sequence, the average template length and type of distinct entities) would guarantee the biological meaning of the extracted templates.
Experiment results of the overlap among the misclassified samples by different features show that there is great complement among different features: in many cases, the false prediction caused by one feature would be treated correctly by another one. And a single type of feature is easy to lead the classifier over-fitting on the data distribution (see Table 1 and Figure 1). Thus, the integration among different features would be beneficial. In this sense, we propose two kinds of integration from different levels: feature-level and classifier-level to integrate all of above proposed features.
where max_value and min_value are the maximum and minimum values that are actually seen in the input feature set.
But there is an obvious defection in the above method: some lower dimensional features might be overwhelmed by the higher dimensional features (e.g. named entity feature has only 5 dimensions while length-fixed string feature has more than 10 thousand dimensions). Based on this consideration, we turn to perform the integration on the classifier level and propose two different ways to implement the integration. The first one is to integrate the output of each classifier: after training classifiers on different types of features respectively, we normalize and unify the output of each classifier into feature vectors and train a classifier. The other one is Adaboost, a general classifier integration method, which has two major advantages: firstly, Adaboost tunes the weight of each classifier according to its performance in each kind of training samples, which could fully utilize the discriminative capability of features; secondly, soft margin of Adaboost avoids the risk of over-fitting in the training process. These approaches well overcome the defection mentioned above.
Results and discussion
The benchmark corpus is provided by BioCreAtIvE 2006. The training set contains 3536 relevant documents (title and abstract) and 1959 irrelevant. The testing set contains 750 documents, 375 of which are labelled as relevant. All of the proposed features and integration methods are implemented on the linear-kernel SVM.
In Table 2, TF*ML schema improves recall performance by 6.9% without losing precision compared to the traditional TF*IDF schema. The improvement validates the effectivity of exploiting the category relevance information of features and testifies ML to be a more effective and general feature value schema in general text classification applications.
Length-fixed String Feature (TF*IDF)
Unigram + Bigram
Named entities and semantic template features
Named Entity and Semantic Template Feature
Unigram + Bigram (TF*IDF)
Protein Entity occurrence
String + Entity
String + Template
String + Entity + Template
Integration on length-fixed string feature, entity feature and template feature
Unigram + Bigram
Output based Integration
Statistical significance test
Since the size of the evaluation corpus is not large enough, it is necessary to perform the statistical significance test to validate the reliability of our proposed features and integration methods. Here we employ s-test to evaluate the performance of systems on the pooled decisions on the individual documents/category pairs .
Statistical Significance Test (s-test).
The null hypothesis is that the performance of two methods is the same; the alternative hypothesis is that the former is better than the latter.
String Vs. Unigram+Bigram
TF*ML Vs. TF*IDF
KeyBT Template Vs. Unigram+Bigram
Feature Level Integration Vs. Unigram+Bigram
Classifier Level Integration Vs. Unigram+Bigram
Comparison with the state of arts
Mean, Standard Deviation and Best Performance from BioCreAtIvE 2006 Vs Our Final Performance.
The best performance from BioCreAtIvE 2006 is selected from 51 runs of 19 teams respectively.
The experiment results clearly demonstrate that the lower level features are endowed with better generalization capability, but hampered by lower accuracy; higher level features contain rich domain dependent information, with better specificity but poor universality. Integration of different level of features would benefit from the different aspects of the feature space, which would reinforce the domain dependent classification and overcome the bias on the data distribution.
Propose novel domain independent feature value schema TF*ML and length-fixed string feature;
Introduce domain dependent features (e.g. named entities, semantic templates) into the biological literature classification, and propose a novel template extraction algorithm KeyBT;
Investigate the feature-level and classifier-level integration methods to incorporate the information from different levels and perspectives.
Now, the proposed methods are being integrated into our online service ONBIRES  as a pre-processing module. In the next step, we will be engaged in the aspect of incremental learning to make our approaches portable to different datasets.
This work was supported by the Chinese Natural Science Foundation under grant No. 60572084 and 60621062, National High Technology Research and Development Program of China (863 Program) under No. 2006AA02Z321, as well as Tsinghua Basic Research Foundation under grant No. 052220205 and No. 053220002.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 3, 2008: Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM) 2007. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S3.
- Forman G: An Extensive Empirical Study of Feature Selection Metrics for Text Classification, in the Journal of Machine Learning Research. Special Issue on Variable and Feature Selection 2002.Google Scholar
- Saeys Y, Inza I, Larranaga P: A Review of feature selection techniques in bioinformatics. In Bioinformatics. Volume 23. Oxford University Press; 2007:2507–2517.Google Scholar
- Regev Y, Finkelstein-Landau M, Feldman R: Rule-based extraction of experimental evidence in the biomedical domain: The KDD Cup 2002 (task 1). ACM SIGKDD Explorations Newsletter 4(2):90–92.Google Scholar
- Shi M, Edwin DS, Menon R, et al.: A machine learning approach for the curation of biomedical literature-KDD Cup 2002 (task 1). ACM SIGKDD Explorations Newsletter 4(2):93–94.Google Scholar
- Ghanem MM, Guo Y, Lodhi H, Zhang Y: Automatic scientific text classification using local patterns: KDD Cup 2002 (task 1). ACM SIGKDD Explorations Newsletter 4(2):95–96.Google Scholar
- Han B, Obradovic Z, Hu ZZ, Cathy WH, Vucetic S: Substring selection for biomedical document classification. Bioinformatics 2006, 22(17):2136–2142.View ArticlePubMedGoogle Scholar
- Wilbur JW: Boosting Naive Bayesian Learning on a Large Subset of MEDLINE. In Proceedings of AMIA Symposium. Los Angeles, CA; 918–922.Google Scholar
- Dobrokhotov PB, et al.: Combing NLP and probabilistic categorization for document and term selection for Swiss-Prot medical annotation. Bioinformatics 2003, 19: 91–94.View ArticleGoogle Scholar
- Cohehn AM: Automatically Expanded Dictionaries with Exclusion Rules and Support Vector Machine Text Classifiers: Approaches to the BioCreAtIve 2 GN and PPI_IAS Tasks. In Proceedings of the BioCreative Workshop: 22–25 April 2007. Madrid.. Edited by: Krallinger M. Spanish National Cancer Research Centre; 2007:169–174.Google Scholar
- BioCreAtIvE [http://biocreative.sourceforge.net/]
- Salton G, McGill MJ: Introduction to Modern Information Retrieval. McGraw-Hill, Inc; 1986.Google Scholar
- Zhang D, Lee WS: Extracting Key-Substring-Group Features for Text Classification. In the proceeding of KDD'06. Philadelphia, Pennsylvania, USA; 2006:474–483. August 20–23View ArticleGoogle Scholar
- Yang Y: A Comparative Study on Feature Selection in Text Categorization. School of Computer Science, Jan O.Pedrsen Verity, Inc. Sunnyvale, CA USA Carnegie Mellon University Pittsburgh, PA, USA.Google Scholar
- Matheus CJ: Adding Domain Knowledge to SBL through Feature Construction. In Proceedings of the Eighth National Conference on Artificial Intelligence. Boston; 1990:803–808.Google Scholar
- ABNER v1.5 homepage [http://pages.cs.wisc.edu/~bsettles/abner/#performance]
- Ding SL, Huang ML, Zhu XY: Semi-supervised Pattern Learning for Extracting Relations from Bioscience Texts. Proceedings of the 5th Asia-Pacific Bioinformatics Conference 2007, 307–316.View ArticleGoogle Scholar
- Lillis D, Toolan F, Collier R, Dunnion J: ProbFuse: A Probabilistic Approach to Data Fusion. In Proceedings of SIGIR'06. August. Seattle, Washington, USA; 2006.Google Scholar
- Ratsch G, Onoda T., Muller KR: Soft Margins for AdaBoost. Machine Learning 2001, 42: 287–320.View ArticleGoogle Scholar
- Huang ML, Zhu XY, Ding SL, Yu H, Li M: ONBIRES: Ontology-based biological relation extraction system. 4th Asia-Pacific Bioinformatics Conference 2006, 327–336. FEB 13–16Google Scholar
- Yang Y, Liu X: A re-examination of text categorization methods. In Proceedings of SIGIR'99 August. Berkley, CA USA; 42–49.Google Scholar
- ONBIRES Homepage [http://spies.cs.tsinghua.edu.cn:8080]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.