Simple and efficient machine learning frameworks for identifying protein-protein interaction relevant articles and experimental methods used to study the interactions

Background Protein-protein interaction (PPI) is an important biomedical phenomenon. Automatically detecting PPI-relevant articles and identifying methods that are used to study PPI are important text mining tasks. In this study, we have explored domain independent features to develop two open source machine learning frameworks. One performs binary classification to determine whether the given article is PPI relevant or not, named “Simple Classifier”, and the other one maps the PPI relevant articles with corresponding interaction method nodes in a standardized PSI-MI (Proteomics Standards Initiative-Molecular Interactions) ontology, named “OntoNorm”. Results We evaluated our system in the context of BioCreative challenge competition using the standardized data set. Our systems are amongst the top systems reported by the organizers, attaining 60.8% F1-score for identifying relevant documents, and 52.3% F1-score for mapping articles to interaction method ontology. Conclusion Our results show that domain-independent machine learning frameworks can perform competitively well at the tasks of detecting PPI relevant articles and identifying the methods that were used to study the interaction in such articles. Availability Simple Classifier is available at http://sourceforge.net/p/simpleclassify/home/ and OntoNorm at http://sourceforge.net/p/ontonorm/home/.


Introduction
Protein-protein interactions (PPI) are responsible for many biological phenomena. Understanding these interactions can greatly benefit biological research; for example, it can help us understand causes of certain diseases which can in turn lead to development of therapeutic interventions. A case of significance of protein-protein interactions can be seen for the BRCA1 and BARD1 proteins, which have been reported to interact with each other and a mutation in BRCA1 can disrupt this interaction, which can lead to breast cancer [1].
The importance of PPIs has led to the development of several curated databases including IntAct [2], BioGRID [3] and MINT [4]. These databases are generally curated manually by humans and store information including the proteins that interact with each other, the articles in which these interactions were detected and the methods that were used to discover these interactions. However, manually curating articles for PPIs is a time consuming process and due to the fast rate of research and rapid increase in amount of published literature, the amount of effort required to maintain such databases has increased significantly. This has spurred the development of text-mining approaches to automate identification of such interactions and help the manual curation process.
One of the important tasks is to identifying the methods used to study PPIs, known as the interaction method task (IMT). IMT helps database curators determine the validity of the reported interactions. Certain methods give better evidence of an interaction than others [5,6]. The methods sub-ontology in the PSI-MI (Proteomics Standards Initiative-Molecular Interactions) ontology is a controlled vocabulary to which interaction methods can be mapped [7]. Annotating methods with PSI-MI's methods sub-ontology will help database curation efforts.
To efficiently identify PPI interaction methods, another important task is to first determine if the given article contains protein-protein interaction or not, known as article classification task (ACT). ACT is indispensable for other PPI related text mining applications, such as interaction event detection.
Different approaches have been developed for the ACT task. A simple approach is to make use of n-gram features to train supervised machine learning algorithms, which have been deployed in many similar tasks [8][9][10][11][12][13][14][15][16][17][18][19]. Normalization and feature selection may be conducted before training the classifiers. Domain-specific adaptations of this approach have been used for this task as well. An modification was proposed to make use of contextual bag-of-words [20]. The context information included the number of protein names that appear in the abstract of an article to be classified with the assumption that the presence of more protein names in the abstract indicates greater likelihood that the article contains protein-protein interaction data. Support vector machine (SVM) classifiers were trained on these contextual bag-of-words features. Other extension work added MeSH terms as features along with selected n-gram features [21][22][23][24]. Grover et al. used a "bag-of-nlp" approach where the output of a natural language processing pipeline was augmented with word features to classify articles [25]. Dogan et al. identified the 10-nearest neighbors in the training data of the test article and used the gold standard annotation of these 10-nearest neighbors as features along with the n-gram features [26].
Approaches that explore features beyond words for classification training have also been proposed. A semisupervised approach was suggested by [27], where dependency tree based patterns are automatically learned from the training data. A set of eight patterns were manually seeded for this approach. Another approach made use of information retrieval techniques to identify protein-protein extraction relevant documents [28]. A set of well-known protein interaction related keywords is used as queries. An approach by Kolchinsky et al. made use of features from citation network of relevant literature to classify articles [15]. Kim and Wilbur extracted automatically grammatical patterns from the training corpus and used these patterns for ACT [29]. They found that this approach performs better than the machine learning approaches that were based on bag-of-words representation.
Although a lot of research has been done for the ACT, research for identifying interaction methods is limited. Similar to our goal, most studies in this area attempt to associate method nodes in the PSI-MI ontology to articles. The OntoGene system developed by Rinaldi et al. [23,30] makes use of pattern matching techniques to identify interaction methods. The system makes use of handcrafted patterns to improve performance. Pattern matching approach has be employed by Lourenco et al. [31] as well. Dogan et al. combined pattern matching and k-nearest neighbors' annotations for this task [26]. They also mapped the article's MeSH terms to PSI-MI nodes to identify relevant method nodes. Use of machine learning-based approaches that view IMT as a document-level classification problem have been reported [24,32]. They expanded the synonyms for PSI-MI nodes by adding synonyms from UMLS Metathesaurus. Matos et al. approached IMT as an information retrieval problem [19]. The documents were indexed using Lucene and retrieved using method names.
In this study, we report on the development of machine learning frameworks to identify articles that contain protein-protein interaction data and then process these articles to identify the methods that were used to discover protein-protein interactions. Unlike previous approaches many of which rely on human curated data or domain-specific features, our goal is to develop an adaptable framework by exploring domain independent features, which can be generalized to other text mining applications with no or minimum adaptation. For example, our ACT framework can be applied to train and classify any type of text documents, regardless of the domain they belong to. Similarly, our IMT framework can be used to map terms from any ontology to any text. As a result, we explored machine learningbased approaches using features that are domain independent.
The BioCreative (Critical Assessment of Information Extraction systems in Biology) challenge is a community effort to promote the development of biomedical text mining applications. Till date, four BioCreative challenges have been organized. Interaction methods featured in two of these challenges while article classification task featured in the last three challenges [33][34][35][36][37]. The latest BioCreative challenge, BioCreative III, includes both ACT and IMT. We used the data and evaluation provided through BioCreative III to develop and evaluate our machine learning frameworks.

Methods
We explored supervised machine learning approaches for both ACT and IMT. The data we used for training is described below, followed by the ACT classification and IMT classification tasks.

Training, development and test data
The organizers of BioCreative III provided training data, development data and test data for both tasks [38]. The size of the data provided for ACT and IMT is mentioned in Table 1 and Table 2, respectively. Note that the data provided for ACT and IMT are independent of each other. For ACT, the distribution of positive and negative instances in the development data reflected the true distribution of positive and negative instances, i.e. approximately 16% of the articles contained protein-protein interactions data. For ACT training data, equal number of positive and negative instances were provided. The distribution of instances in the test data was similar to the distribution of instances in the development data. The article's title and abstract were used for training and testing. For IMT, the task was to identify interaction methods at a document level and not at interaction or mention level. The methods sub-ontology of the PSI-MI ontology was used to obtain the collection of possible methods. From this sub-ontology, 115 nodes were allowed for IMT. Four nodes from the 115 allowed nodes accounted for roughly half of all annotations; these were (in order of highest to lowest frequency): "anti-bait coimmunoprecipitation", "anti-tag coimmunoprecipitation", "pull down" and "two hybrid". "Anti-bait coimmunoprecipitation" and "anti-tag coimmunoprecipitation" accounted for one third of all annotations. Within the test data, 222 articles out of the 305 articles were annotation-relevant; hence the remaining 83 articles had no annotations assigned to them. The full-text of the article was used for training and testing. Although the full-text articles were originally in PDF format, the organizers of BioCreative III also provided the corresponding files in text format, which were used for training and testing in our experiments.

ACT Classification
As noted earlier, the distribution of data in the development data is similar to the distribution of test data. Hence, for tuning, we trained models on the development data and tested them on the training data. We trained two different classifier models -Support Vector Machines (SVM) with polynomial kernel [39] and Naïve Bayes Multinomial (NBM). We used the implementation provided in the Weka data mining library [40] (downloaded from: http://www.cs.waikato.ac.nz/ml/weka/). We normalized all text by lowercasing all characters, removing punctuations, stemming all words (using Porter stemming algorithm [41,42]) and removing numbers. We then extracted unigrams (individual words) and bigrams (two consecutive words) as features. As this led to a large number of features, we conducted feature selection with two feature scoring algorithms: mutual information and chi-square score. All features were scored with these algorithms and we used the top 20, 50, 100, 400 and 1000 features to train the classifier.
We explored various combinations of different classifier algorithms, feature selection methods and feature numbers mentioned above. We tried using unigrams only as well as using both unigrams and bigrams. All features were uniformly weighted when provided to the machine learning classifiers. For the BioCreative III challenge, we were allowed to submit 10 runs for ACT. The runs are listed in Table 3. For six of the ten runs, we combined the training and development data to train the classifier, as we expected larger training data to perform better. At the same time, the distribution of the development data was similar to that in the test data; hence, for the remaining four runs, we trained the classifier on development data only.
For ACT, we developed a framework that can apply the feature selection methods described above with different classifier algorithms. The framework is called SimpleClassifier and is available online at http://  sourceforge.net/p/simpleclassify/home/. It can be used to train classifiers for any text collection.

IMT Classification
The IMT involved mapping nodes in PSI-MI ontology to articles. For each ontology node, we obtained the concept name and its synonyms. We manually added synonyms for some ontology nodes, such as "anti bait immunoprecipitation" for "anti bait coimmunoprecipitation" and "radioligand binding" for "saturation binding". A keyword for each ontology node was manually extracted by the first author, for example, "coimmunoprecipitation" for "anti bait coimmunoprecipitation". The keywords were selected based on the author's judgment of the most informative word in the concept. We extracted unigrams and bigrams from each node's concept name and synonyms. For each unigram and bigram, we calculated the mutual information score and chisquare value using the training data. The top 10 unigrams and bigrams by mutual information score and chi-square value are displayed in Table 4.
We approached IMT as a classification task, where we try to determine if an article-ontology node pair is positive or negative. We identified 21 features (as listed in Table 5) and scored those features for each articleontology node pair. We then trained machine learning classifiers Random Forest [43], Random Committee [44], Naive Bayes Tree [45] and J48 [46] to predict the label for each article-ontology node pair. All features were uniformly weighted when provided to the machine learning classifiers.
We then conducted feature selection using the chisquare measure which comes with Weka's built-in feature selection module. To identify the best feature set, we tuned the classifiers by training on training data and testing on development data. We counted the number of true positives, false positives, false negatives and true negatives. An article-node pair was considered to be true positive if both gold-standard and the classifier identified the pair to be positive. A pair was considered to be false positive if the gold standard did not consider the pair to be positive, but the classifier did. A pair was considered to be false negative if the gold standard considered the pair to be positive, but the classifier did not. A true negative was marked if both the gold standard and the classifier considered the pair to be negative. Using these counts, we calculated the precision and recall with the following formulae - From the recall and precision, the F1-Score was calculated by taking their harmonic mean. The F1-Score obtained was used as a measure of performance during the parameter tuning process, by which we obtained the best number of features for each classifier. For the BioCreative III challenge, each participating team was allowed to submit 10 runs for ACT. Five runs could be submitted offline and the other five runs could be submitted online, using XML-RPC. Runs 1-5 were submitted offline, while runs 6-10 were submitted online. For all runs, we used mutual information feature selection algorithm, as it gave better performance than chi-square score. We submitted 10 runs, listed here. For each article, we identified the evidence sentence from which each interaction method was identified. For this, we calculated a score for each sentence, and the sentence with the highest score was considered to be associated with the interaction method. To calculate the score, the unigrams in the interaction method's name were looked for in each sentence. If a unigram was present in the sentence, then the unigram's chi-square value was added to the sentence's score. If no unigrams were present in the sentence, then a score of 0 was assigned to the sentence. If multiple sentences had the same score, the longest sentence was associated with the interaction method.
As a result, we developed a framework for IMT that can make use of the features mentioned above and conduct feature selection. The framework is called Onto-Norm and is available online at http://sourceforge.net/p/ ontonorm/home/. It can be used to train models with any ontology and text collection.
Similar to ACT, we were allowed to submit 10 runs for IMT. The runs are listed in Table 6. We trained the classifier on the combination of both training and development data for the runs.

Evaluation
As mentioned earlier, we participated in the BioCreative III challenge for evaluation. The evaluation of the runs was conducted by the organizers. Micro-averaged F1score, Matthew's Correlation Coefficient and AUC iP/R were used as evaluation metrics. Matthew's Correlation Coefficient is a measure of binary classification and is based on the chi-squared statistic obtained for a 2x2 contingency table. It is measured by the following formula - where n is the total number of observations. The area under the curve (AUC iP/R) was measured by drawing the precision/recall curve and interpolating the curve. The area under this curve is the AUC iP/R.
For ACT, the accuracy, sensitivity and specificity of the system were also measured. The accuracy is the ratio of correctly classified instances and all instances, sensitivity is the ratio of true positive instances and all positive instances and specificity is the ratio of true negative instances and all negative instances.

Results
When tuning, we observed that for ACT, the best F1-Scores were obtained when using 400 or 1000 features with Naïve Bayes Multinomial (NBM) and 400 features with Support Vector Machines (SVM) (Additional File 1). The top 10 features (unigrams and bigrams) were: 'interact', 'interact with', 'bind', 'protein', 'domain', 'bind to', 'phosphoryl', 'regul', 'complex' and 'activ'. Similarly,  For the BioCreative III challenge, each participating team was allowed to submit 10 runs for IMT. Five runs could be submitted offline and the other five runs could be submitted online, using XML-RPC. Runs 1-5 were submitted offline, while runs 6-10 were submitted online. For all runs, we combined the training and the development data. We submitted 10 runs, listed here.
for IMT the best results were seen when using 14 features with J48, 7 features with Naïve Bayes Tree, 12 features with Random Forest and 12 features with RandomCommittee (Additional File 2). The top five features were: 'unigram chi-square sum', 'unigram mutual information sum', 'unigram chi-square sum concept only', 'unigram mutual information sum concept only' and 'bigram mutual information sum'. Our runs try to cover all algorithms and different feature combinations. For ACT and IMT, the performance of our runs is shown in Table 7 and Table 8, respectively. For ACT, we found that performance of SVM-based classifiers was better than NBM-based classifiers, although during tuning we found that NBM-based classifiers performed better. For IMT, the result of all classifiers were very close to each other; F1-score difference between best and worst runs was less than 2.5% points. We found that runs for which feature selection was done performed better than the runs for which all features were used, indicating that certain features are not useful.

Results in relation to other systems
We compared the performance of our system with other teams that participated in ACT and IMT tasks for the Bio-Creative III challenge.

Discussion
We have developed supervised machine learning frameworks to identify articles that contain protein-protein interaction data and to map ontology nodes to text of an article. Our goal was to develop these approaches independent of domain knowledge and manual intervention, such that they can be viewed as frameworks that can be applied to other article classification task and ontology mapping tasks. For ACT, our system, Simple Classifier, meets these goals. For IMT, we did modify the ontology by manually adding synonyms and keywords, because of which we cannot claim that Onto-Norm meets our goal of being free from manual intervention; however, given an ontology with comprehensive list of synonyms, this manual intervention  would be unnecessary. In this sense, OntoNorm can be used to map terms from any given ontologies to any text articles. For ACT, our approach was simpler than the approach used by many other teams at the BioCreative challenge. Despite the simplicity, our system ranked 2 nd amongst 10 teams, and the difference between the performance of the team that ranked 1 st and our system was marginal, suggesting that the frameworks we employed in this study are very efficient, competitive and robust. Our SVM-based runs obtained poor AUC iP/R, despite of obtaining good accuracy and F1-score. This was because for most instances, the annotation confidence assigned by the classifier was 100%, which prevented the results to be ranked meaningfully. Except for AUC iP/R, SVM-based models performed better than NBM-based models on the test data, although NBM-based models performed better during tuning. This maybe because the NBM-based models overfit on the training data.
On analyzing incorrectly classified ACT cases, we observed that false positives were seen when an article contained terms that usually indicate protein-protein interaction, but were not used in that context; for example, the article with PMID:19694809 uses the keyword 'interaction', but does not indicate protein-protein interaction in this context. At the same time, false negatives were seen when such terms were missing, although the article contained protein-protein interaction data; for example, article with PMID:19724778. The error analysis uncovers one disadvantage of our machine-learning framework that it is based only on lexical features, which may not contain sufficient information and can cause ambiguities in some cases. It also suggests that deep linguistic analysis (e.g. syntactic and semantic analysis) might be needed to further enhance the system's performance.
For IMT, we identified several domain independent features to classify article-node pairs. We believe that the approach works well, as our system was placed 3 rd amongst 8 teams at BioCreative III. We found that treebased classifier algorithms such as Random Forest and J48 performed better at this task. Most of our errors were seen when annotating nodes "anti-tag coimmunoprecipitation" and "anti-bait coimmunoprecipitation" as "coimmunoprecipitation" was usually mentioned in relevant articles, but whether it was anti-tag or anti-bait coimmunoprecipitation was not explicitly stated. For example, article [47] was falsely annotated with anti tag coimmunoprecipitation.
We found that unigram related features ranked higher than bigram related features in the IMT task, as 4 out of 5 top features are from unigrams. We speculate that this is because of the high variance when discussing different interactive methods in articles, such that unigram features become more reliable than bigrams.

Conclusion
We have developed machine learning frameworks that make use of domain independent features to classify text (Simple Classifier) and to map nodes in an ontology to text (OntoNorm). These frameworks obtain competitive performance compared with other participant teams when applied on tasks to identify articles that contain protein-protein interaction data and to identify methods from an ontology that were used to study these interactions.
In the future, we may apply our frameworks on other text mining applications. In addition, our current approach for OntoNorm does not make use of the hierarchy of the ontology, which will be investigated and evaluated in the future as well.

Additional material
Additional file 1: ACT Tuning data Results of various classifier algorithms, feature selection algorithms and number of features combinations when trained on ACT development data and tested on ACT training data Additional file 2: IMT Tuning data Results of various classifier algorithms, feature selection algorithms and number of features combinations when trained on ACT development data and tested on ACT training data