Classifying protein-protein interaction articles using word and syntactic features
© Kim and Wilbur; licensee BioMed Central Ltd. 2011
Published: 3 October 2011
Identifying protein-protein interactions (PPIs) from literature is an important step in mining the function of individual proteins as well as their biological network. Since it is known that PPIs have distinctive patterns in text, machine learning approaches have been successfully applied to mine these patterns. However, the complex nature of PPI description makes the extraction process difficult.
Our approach utilizes both word and syntactic features to effectively capture PPI patterns from biomedical literature. The proposed method automatically identifies gene names by a Priority Model, then extracts grammar relations using a dependency parser. A large margin classifier with Huber loss function learns from the extracted features, and unknown articles are predicted using this data-driven model. For the BioCreative III ACT evaluation, our official runs were ranked in top positions by obtaining maximum 89.15% accuracy, 61.42% F1 score, 0.55306 MCC score, and 67.98% AUC iP/R score.
Even though problems still remain, utilizing syntactic information for article-level filtering helps improve PPI ranking performance. The proposed system is a revision of previously developed algorithms in our group for the ACT evaluation. Our approach is valuable in showing how to use grammatical relations for PPI article filtering, in particular, with a limited training corpus. While current performance is far from satisfactory as an annotation tool, it is already useful for a PPI article search engine since users are mainly focused on highly-ranked results.
The study of protein-protein interactions (PPIs) is one of the most critical issues in life-science research for understanding the function of individual proteins and the organization of biological processes. A plethora of biomedical literature that describes protein-protein interaction experiments by specifying individual interacting proteins and the corresponding interaction types exists. Since the vast majority of protein interaction information still exists in research articles, many efforts have been made to create protein interaction databases such as BIND , MINT , IntAct , and DIP . However, several constraints such as the problems of manual curation of a database, the rapid growth of the biomedical literature, and of newly discovered proteins, make it difficult for database curators to keep up with the published information .
The BioCreative (Critical Assessment of Information Extraction Systems in Biology) challenge is a community-wide effort to build an evaluation framework for assessing text mining systems in biological domains . PPI tasks were specially designed to study the detection of protein-protein interactions from literature, which have two subtasks in BioCreative III, ACT (Article Classification Task) and IMT (Interaction Method Task). ACT is the task to choose relevant abstracts to PPIs. IMT is the task to find experimental evidence of interacting protein pairs. Particularly, ACT is important since filtering PPI-relevant articles is a fundamental step for building annotation databases. Thus, high performance ACT systems can help reduce the curation burden at the initial curation stage.
Various approaches have been proposed to extract PPI information from biomedical literature. One popular method is to use predefined phrase patterns or to exploit co-occurrence of two protein names from text. These methods, however, have inherent limitations because they only find predefined PPI patterns, and are not able to discover new patterns. Machine learning (ML) techniques can discover new patterns not captured in a known trigger word list. Hence, ML approaches have gained popularity in recent years. Support vector machines (SVMs) have been widely used, and demonstrated outstanding performance [7–9]. Naive Bayes, k-nearest neighbor, decision trees, and neural networks have been alternatively used to extract PPI information [7, 9]. Natural language processing (NLP) is a strategy utilizing linguistic features obtained from text, and also has been used for PPI extraction [10–14], where PPI sentences are assumed to have unique grammatical structures. However, the effectiveness of using parsing information has been little investigated at the article classification level.
Here, we present the method and the results from our participation in the BioCreative III ACT competition [15, 16]. Our main focus on this task was to explore the effectiveness of applying word and grammatical features for our supervised learning approach to PPI article classification. It includes minimizing external knowledge other than training set such as templates or rule-based approaches developed on other tasks, and external databases, e.g., gene/protein dictionaries or full text information. The proposed method combines NLP strategies with ML techniques to utilize both word and syntactic features from text. To obtain gene names, articles are first tagged using a Priority Model . This step is essential because protein names are the most important words triggering PPI descriptions. The gene-tagged articles are further analyzed to obtain word and syntactic features.
Although the current approach has much room for improvement, it produced the top-ranked performance among all submitted runs in the BioCreative III ACT task. As a result, we found that, in our system pipeline, syntactic patterns along with word features can effectively help distinguish between PPI and non-PPI articles. Note that the only external resource we used for the task was gene name data for the Priority Model, so the learning was solely limited to the given training corpus, which was a series of BioCreative datasets.
The paper is organized as follows. In the next section, we describe the results of our submission on the BioCreative ACT task. This is followed by discussion and conclusions drawn from our experience in BioCreative III. Lastly, our methods employed are explained.
The corpus information used in our experiments.
BioCreative III Training Set
BioCreative III Development Set
Total Training Set
BioCreative III Test Set
Where and . Unlike F1 and MCC scores, AUC iP/R rather evaluates the performance of ranked results by considering precision rates for all recall points. For ranking systems or search engines, the performance at high ranks is more important than overall ranking, hence AUC iP/R is a good indicator of ranking-based performance. In Discussion, we instead use average precision for the ranking performance because it measures ranking performance in a more conservative way. Average precision is the average of the precisions at the ranks where relevant documents appear. It corresponds to the non-interpolated AUC P/R score. It is generally a lower value than AUC iP/R, but also emphasizes the higher ranks.
The feature combinations used for submitted runs on the article classification task
BC3 Dev Set
Official scores for the ACT competition.
Performance results for corrected PPI classification on the ACT test set.
Run 5 utilized binary feature combinations to capture higher-order relationships between features. The performance in Run 5 changed very little compared to Run 1 and Run 3, which proves to be an unsuccessful attempt, and it is not as we expected. For Run 5, we did not have time to analyze and optimize for the submission. According to our post-workshop experiments, classification performance is very sensitive to higher-order feature combinations, and difficult to optimize. For Run 5, we simply found a weight threshold which retained as many features as possible and yet increased performance for the BioCreative development set. That resulted in a total of 286,547 features. In the Discussion, we further investigate the effect of higher-order features.
Given the time available for the task, the submitted runs are obviously not fully optimized results. We believe further improvement is possible based on the ACT development set and also the recently released gold standard test set. But, we did not have sufficient time to investigate all the options for optimizing the current system with both datasets. Overtraining classification performance on the development set leads to an overfitting problem and decreased classification performance on the test set. So, our tuning for submitted runs was centered rather on different data and feature combinations, not fine tuning for parameters and heuristic knowledge. The performance produced by our system shows that the strategy of using both word and syntactic features in our classification framework is a good combination for the PPI article classification task.
Article filtering with imbalanced classes
One main issue in the BioCreative III ACT competition is the imbalance problem between the number of positive and negative articles. Negative examples in the ACT development set are 82.95% of the whole development set. In the BioCreative test set, the ratio goes up to 84.83%. However, the training corpus gathered from previous BioCreative competitions is rather a balanced dataset. To overcome this problem, we tried several approaches. The popular method to solve the imbalance problem on training data is balancing the number of training examples by over- or under-sampling [25, 26]. This sampling technique can be utilized for the imbalance problem on test data. For example, the training corpus can be reorganized by over-sampling non-PPI articles or under-sampling PPI articles. Another approach for addressing the imbalance issue is the careful selection of negative examples from unlabeled data as an additional training source. This method is similar to active learning . Also, cost-sensitive learning  can be used along with an ensemble machine with multiple classifiers. Nevertheless, those attempts were not successful for the BioCreative ACT task.
The performance drop with an imbalanced test set compared to a balanced one can be easily explained. Assuming there is a prediction system performing at 90% precision for balanced data, 10% of positive predictions are false positive cases. If negative examples of the same kind are increased by a factor of six, false positive predictions are six times higher than in the former case. That results in a precision drop to 60% from 90%. This imbalance problem affects most of the performance scores except for accuracy. Accuracy can remain high because of dominant negative examples as explained in the Results section. In our system, the classification performance on training data exceeds 96% F1 score and 99% average precision. But this cannot ensure high performance on unbalanced test data.
Utilizing word and syntactic feature types
Average precision rates when adding grammar relations to single words.
Single Words (SW)
Grammar Relations (GR)
SW + GR
As shown in the table, adding word-word relationships to single-word features boosts up the performance by 3.7% in naïve Bayes classifiers. For SVM and Huber classifiers, the improvement is less, however it shows that word dependency provides a positive effect for PPI article classification. The Huber classifier is the chosen approach for both data scalability and classification performance. Based on the performance comparison in Table 5, our Huber approach produces the best average precision overall.
Performance changes on the ACT development set by varying feature types.
The system reaches top performance on the BioCreative III development set when baseline and higher-order features are both used, which is the setting in Run 5. However, higher-order features are not easy to tune. More importantly, higher-order features do not provide the best result for the BioCreative III test set. In the proposed approach, gene name detection is a critical component of the system since gene names are handled individually and gene anonymization is based on this gene detection. During the BioCreative III period, we found some flaws of the Priority model in detecting correct gene names. Therefore, current performance is also limited by this detection capability.
Ranking system for PPI article classification
In a binary classification system, F1 and MCC scores are useful to evaluate system performance. But, in a ranking system, top-ranking performance is more important than overall ranking. AUC iP/R and average precision are sensitive indicators for ranked results, and our system was basically tuned to achieve better average precision (AUC P/R) for submitted results. The best AUC iP/R score we obtained from official results is 0.6798, whereas the average AUC score of all participants is 0.4975 and the median AUC score is 0.5367. The precision-recall curves between our system and others also show significant differences in top-ranking results (http://www.biocreative.org/resources/biocreative-iii/workshop-talks). Figure 2 depicts the precision-recall curve for Run 4. The precision is over 90% until reaching 22% recall. Another perspective of ranking performance is the precision at rank n (P@n). For Run 5, P@100, P@200, and P@300 are 94%, 92%, and 85%, respectively. This shows that the proposed approach is effective for a ranking-based search system even though the overall performance is far from fully automating PPI article selection for annotation .
In the paper, we present our system and its performance for the BioCreative III ACT competition. Our focus for the task was to develop a machine learning framework to effectively capture PPI articles from biomedical literature with minimal external resource use. The main idea here is detecting gene names and utilizing word-to-word relationships for automatically learning unique PPI patterns. The proposed approach identifies gene names by a Priority Model, and dependency relations are extracted by analyzing grammatical structures in sentences. A large margin classifier using the Huber loss function is used to learn from extracted word and syntactic features. Data scalability was also considered in selecting Huber classifiers for expanding target data to the whole PubMed corpus in the future.
Different feature types, including multi-words and grammar relations with stemming, and feature selection were exploited for submitted runs. Different training corpora were also used. Higher-order features were studied to see the possibility of automatic feature expansion. Through these studies, we found that syntactic features are useful at the article classification level as well as at the sentence classification level. Even though there is a limit to detection of correct gene names and the system is not optimized enough for the imbalanced nature of the dataset, the proposed system performs well in both binary classification performance and PPI ranking performance in all different data and feature combinations.
Current classification performance was achieved by only using a data-driven model containing different types of machine learning techniques. However, in the current setup, identifying gene names and analyzing dependency relationship are critical components, which need careful setup through utilizing PPI-related heuristic knowledge. Solving how many higher-order features may help for the PPI classification task is also a remaining issue. As a fully automatic annotation tool, the state-of-the-art systems are still far from real-world use. But, they can be utilized as support systems for manual curation. In particular, based on the BioCreative III ACT performance, our system is already useful for PPI article search in a Web environment.
Gene name detection using a Priority Model
In the proposed approach, gene names are identified using a Priority Model, which is a statistical language model for named entity recognition . For named entities, a word to the right is more likely to be the word determining the nature of the entity than a word to the left in general. The Priority Model was constructed to follow this rule.
To obtain p α and q α , a limited memory BFGS method  and a variable order Markov model  are used. For gene name detection, it is hard to get noise-free positive and negative names, however we used previously built data, SemCat  and Entrez Gene data, as an additional source to learn gene names.
There are common mistakes misclassified as gene names, e.g., mutant and protein, when this model is used. But, adding manual corrections might produce unexpected bias and it was not our intention for the ACT system. Thus, we only added a simple rule that a string with all numbers is not a gene name, which is one of misclassified cases by the learned model. Furthermore, only noun phrases were tested to minimize computation time and detection errors.
Choosing word features
Multi-words: multi-word features are commonly known as n-grams. Since protein names sometimes contain more than a single word and since PPI is the interaction between proteins, n-consecutive words can be a good hint to divide PPI and non-PPI articles. Hence, we use the word combinations, unigram, bigram, and trigram. Only neighbor words are considered because long-distance word relationships are already estimated by syntactic features. Too many consecutive words also increase the problem space exponentially without performance improvement.
Sub-strings: while the basic elements of multi-word features are words, those of string features are characters, i.e., alphabetic and numeric. In biomedical literature, many entities appear in variant forms. Also, there is a report that the difference between distributions on training and test sets in PPI tasks can be reduced by considering character-based features . Therefore, different character lengths from four to seven were tested for the ACT development set, and 6-consecutive characters produced slightly better results than other cases. For our submissions, six characters were used.
MeSH terms: the available training corpus is a set of PubMed articles, which have several fields for each record. The categories include journal title, article title, author list, abstract, MeSH, and article ID. Article title and abstract are the text we mainly used for word and syntactic feature extraction. MeSH terms are the additional source utilized for the PPI task. MeSH is a controlled vocabulary for indexing and searching biomedical literature . MeSH terms are organized in a hierarchical structure and are used to indicate the topics of an article. Thus, this controlled vocabulary set can be helpful to find PPI-relevant articles.
Choosing syntactic features
Dependency parsing: the C&C CCG parser  was used to obtain dependency relations. The software was publicly available and easy to attach to our library. Since we detect gene names beforehand, each gene or protein name can be handled individually. The output of the parser for a sentence is a set of dependency relations, which each contain a grammar relation name, a head word, and a dependent word. So, the head word is coupled with the dependent word by the specific relationship. However, extracted patterns are very sparse considering the size of the training corpus, hence we use an anonymization technique for gene names.
Gene name anonymization: the purpose of PPI article classification is to identify whether an article contains PPI information, not a gene or protein name itself. Therefore, in a dependency relationship, particular protein names are not so important. The gene name anonymization is a simple strategy to exchange a detected gene/protein word for a special tag, e.g., ‘PTNWORD’. This technique decreases the complexity of relationship features, while the relationship information remains the same. Figure 4 shows an example sentence and its syntactic features sets used in our approach.
Choosing higher-order features from feature combinations
After testing on training data using a trained classifier, generate all bigrams by paring any two features from misclassified articles.
A sum of partial derivatives of the loss function over the respective data points is evaluated.
Bigrams occurring at least a times and with a partial derivative at least b in absolute value are selected.
Here, the loss function h is the modified Huber loss function  used by our classifier approach. We set a and b to 4 and 340, respectively, for the official Run 5. These parameters were empirically chosen to produce the best classification performance on the BioCreative III development set.
The Huber classifier [22, 31] used in the BioCreative task is a variant of support vector machines . This method determines feature weights that minimize the modified Huber loss function , which is a function that replaces the hinge loss function commonly used in SVM learning.
where 〈|x|〉 is the average Euclidean norm of the feature vectors in the training set. For the ACT task, the parameter λ' was roughly tuned to maximize average precision rates for the BioCreative development set. Based on these experiments, it was finally set to 0.0005 for submitted runs.
The authors would like to thank Won Kim and Larry Smith for valuable comments on implementing the proposed method. The authors are supported by the Intramural Research Program of the NIH, National Library of Medicine.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 8, 2011: The Third BioCreative – Critical Assessment of Information Extraction in Biology Challenge. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S8.
- Bader GD, Donaldson I, Wolting C, Ouellette BFF, Pawson T, Hogue CWV: BIND-the Biomolecular Interaction Network Database. Nucleic Acids Research 2003, 31: 248–250. 10.1093/nar/gkg056PubMed CentralView ArticlePubMedGoogle Scholar
- Ceol A, Aryamontri AC, Licata L, Peluso D, Briganti L, Perfetto L, Castagnoli L, Cesareni G: MINT, the molecular interaction database: 2009 update. Nucleic Acids Research 2010, 38: D532-D539. 10.1093/nar/gkp983PubMed CentralView ArticlePubMedGoogle Scholar
- Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, Kerssemakers J, Leroy C, Menden M, Michaut M, Montecchi-Palazzi L, Neuhauser SN, Orchard S, Perreau V, Roechert B, van Eijk K, Hermjakob H: The IntAct molecular interaction database in 2010. Nucleic Acids Research 2010, 38: D525-D531. 10.1093/nar/gkp878PubMed CentralView ArticlePubMedGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Research 2004, 32: D449-D451. 10.1093/nar/gkh086PubMed CentralView ArticlePubMedGoogle Scholar
- Blaschke C, Leon EA, Krallinger M, Valencia A: Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics 2005, 6(Suppl 1):16. 10.1186/1471-2105-6-S1-S16View ArticleGoogle Scholar
- Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J, Hirschman L, Valencia A: Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biology 2008, 9(Suppl 2):S1. 10.1186/gb-2008-9-s2-s1PubMed CentralView ArticlePubMedGoogle Scholar
- Donaldson I, Martin J, de Brujin B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin , Bader GD, Michalickova K, Pawson T, Hogue CWV: PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 2003., 4(11):Google Scholar
- Mitsumort T, Murata M, Fukuda Y, Doi K, Doi H: Extracting protein-protein interaction information from biomedical text with SVM. IEICE Transaction on Information and Systems 2006, E89-D(8):2464–2466. 10.1093/ietisy/e89-d.8.2464View ArticleGoogle Scholar
- Sugiyama K, Hanato K, Yoshikawa M, Uemura S: Extracting information on protein-protein interactions from biological literature based on machine learning approaches. Genome Informatics 2003, 14: 699–700.Google Scholar
- Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T: Complex event extraction at PubMed scale. Bioinformatics 2010, 26: i382-i390. 10.1093/bioinformatics/btq180PubMed CentralView ArticlePubMedGoogle Scholar
- Buyko E, Hahn U: Evaluating the impact of alternative dependency graph encodings on solving event extraction tasks. Proceedings of the Conference on Empirical Methods in Natural Language Processing: 9–11 October 2010; Cambridge 2010, 982–992.Google Scholar
- Jang H, Lim J, Lim JH, Park SJ, Lee KC, Park SH: Finding the evidence for protein-protein interactions from PubMed abstracts. Bioinformatics 2006, 22: e220-e226. 10.1093/bioinformatics/btl203View ArticlePubMedGoogle Scholar
- Kim S, Shin SY, Lee IH, Kim SJ, Sriram R, Zhang BT: PIE: an online prediction system for protein-protein interactions from text. Nucleic Acids Research 2008, 36: W411-W415. 10.1093/nar/gkn281PubMed CentralView ArticlePubMedGoogle Scholar
- Miyao Y, Sagae K, Sætre R, Matsuzaki T, Tsujii J: Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 2009, 25: 394–400. 10.1093/bioinformatics/btn631PubMed CentralView ArticlePubMedGoogle Scholar
- Krallinger M, Vazquez M, Leitner F, Valencia A: Results of the BioCreative III (interaction) article classification task. Proceedings of the BioCreative III: 13–15 September 2010; Bethesda 2010, 17–23.Google Scholar
- Kim S, Wilbur WJ: Improving protein-protein interaction article classification performance by utilizing grammatical relations. Proceedings of the BioCreative III: 13–15 September 2010; Bethesda 2010, 83–88.Google Scholar
- Tanabe L, Wilbur WJ: A priority model for named entities. Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology: 4–9 June 2006; New York 2006, 33–40.View ArticleGoogle Scholar
- Huang M, Ding S, Wang H, Zhu X: Mining physical protein-protein interactions from the literature. Genome Biology 2008, 9(Suppl 2):S12. 10.1186/gb-2008-9-s2-s12PubMed CentralView ArticlePubMedGoogle Scholar
- Lowe HJ, Barnett GO: Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. The Journal of the Americal Medical Association 1994, 271(14):1103–1108. 10.1001/jama.271.14.1103View ArticleGoogle Scholar
- Curran JR, Clark S, Bos J: Linguistically motivated large-scale NLP with C&C and Boxer. Proceedings of the ACL 2007 Demonstrations Session (ACL-07 demo): 23–30 June 2007; Prague 2007, 33–36.Google Scholar
- Ando RK: BioCreative II gene mention tagging system at IBM Watson. Proceedings of the Second BioCreative Challenge Evaluation Workshop: 23–25 April 2007; Madrid 2007, 101–103.Google Scholar
- Zhang T: Solving large scale linear prediction problems using stochastic gradient descent algorithms. Proceedings of the 21st International Conference on Machine Learning: 4–8 July 2004; Banff 2004, 919–926.Google Scholar
- Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16(5):412–424. 10.1093/bioinformatics/16.5.412View ArticlePubMedGoogle Scholar
- Porter MF: An algorithm for suffix stripping. Program 1980, 14(3):130–137.View ArticleGoogle Scholar
- Kubat M, Matwin S: Addressing the curse of imbalanced training sets: one-sided selection. Proceedings of the 14th International Conference on Machine Learning: 8–12 July 1997; Nashville 1997, 179–186.Google Scholar
- Batista GEAPA, Prati RC, Monard MC: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 2004, 6: 20–29. 10.1145/1007730.1007735View ArticleGoogle Scholar
- Settles B: Active learning literature survey. In Tech. Rep. 1648. University of Wisconsin-Madison; 2010.Google Scholar
- Nash S, Nocedal J: A numerical study of the limited memory BFGS method and the truncated-Newton method for large scale optimization. SIAM Journal on Optimization 1991, 1(3):358–372. 10.1137/0801023View ArticleGoogle Scholar
- Niu Y, Otasek D, Jurisica I: Evaluation of linguistic features useful in extraction of interactions from PubMed; application to annotating known, high-throughput and predicted interactions in I2D. Bioinformatics 2010, 26: 111–119. 10.1093/bioinformatics/btp602PubMed CentralView ArticlePubMedGoogle Scholar
- Rebholz-Schuhmann D, Jimeno-Yepes A, Arregui M, Kirsch H: Measuring prediction capacity of individual verbs for the identification of protein interactions. Journal of Biomedical Informatics 2010, 43(2):200–207. 10.1016/j.jbi.2009.09.007View ArticlePubMedGoogle Scholar
- Smith LH, Wilbur WJ: Finding related sentence pairs in MEDLINE. Information Retrieval 2010, 13(6):601–617. 10.1007/s10791-010-9126-8PubMed CentralView ArticlePubMedGoogle Scholar
- Vapnik VN: Statistical Learning Theory. Springer; 1998.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.