A linear classifier based on entity recognition tools and a statistical approach to method extraction in the protein-protein interaction literature
© Lourenço et al; licensee BioMed Central Ltd. 2011
Published: 3 October 2011
The Erratum to this article has been published in BMC Bioinformatics 2012 13:180
We participated, as Team 81, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. Our main goal was to exploit the power of available named entity recognition and dictionary tools to aid in the classification of documents relevant to Protein-Protein Interaction (PPI). For the IMT, we focused on obtaining evidence in support of the interaction methods used, rather than on tagging the document with the method identifiers. We experimented with a primarily statistical approach, as opposed to employing a deeper natural language processing strategy. In a nutshell, we exploited classifiers, simple pattern matching for potential PPI methods within sentences, and ranking of candidate matches using statistical considerations. Finally, we also studied the benefits of integrating the method extraction approach that we have used for the IMT into the ACT pipeline.
For the ACT, our linear article classifier leads to a ranking and classification performance significantly higher than all the reported submissions to the challenge in terms of Area Under the Interpolated Precision and Recall Curve, Mathew’s Correlation Coefficient, and F-Score. We observe that the most useful Named Entity Recognition and Dictionary tools for classification of articles relevant to protein-protein interaction are: ABNER, NLPROT, OSCAR 3 and the PSI-MI ontology. For the IMT, our results are comparable to those of other systems, which took very different approaches. While the performance is not very high, we focus on providing evidence for potential interaction detection methods. A significant majority of the evidence sentences, as evaluated by independent annotators, are relevant to PPI detection methods.
For the ACT, we show that the use of named entity recognition tools leads to a substantial improvement in the ranking and classification of articles relevant to protein-protein interaction. Thus, we show that our substantially expanded linear classifier is a very competitive classifier in this domain. Moreover, this classifier produces interpretable surfaces that can be understood as “rules” for human understanding of the classification. We also provide evidence supporting certain named entity recognition tools as beneficial for protein-interaction article classification, or demonstrating that some of the tools are not beneficial for the task. In terms of the IMT task, in contrast to other participants, our approach focused on identifying sentences that are likely to bear evidence for the application of a PPI detection method, rather than on classifying a document as relevant to a method. As BioCreative III did not perform an evaluation of the evidence provided by the system, we have conducted a separate assessment, where multiple independent annotators manually evaluated the evidence produced by one of our runs. Preliminary results from this experiment are reported here and suggest that the majority of the evaluators agree that our tool is indeed effective in detecting relevant evidence for PPI detection methods. Regarding the integration of both tasks, we note that the time required for running each pipeline is realistic within a curation effort, and that we can, without compromising the quality of the output, reduce the time necessary to extract entities from text for the ACT pipeline by pre-selecting candidate relevant text using the IMT pipeline.
A basic step toward discovering or extracting information about a particular topic in biomedical text is the identification of a set of documents deemed relevant to that topic. Separating relevant from irrelevant documents is an example of document classification. Due to the central role document classification plays in biomedical literature mining, part of the BioCreative (Critical Assessment of Information Extraction systems in Biology) challenge evaluation is the Article Classification Task (ACT). In the last three challenges this task has focused on the classification of articles based on their relevance to Protein-Protein Interaction (PPI) .
For the BioCreative challenges 2 (BC2) and 2.5 (BC2.5) we have developed the lightweight Variable Trigonometric Threshold (VTT) linear classifier that employs word-pair textual features and protein counts extracted using the ABNER tool . VTT was one of the top performing classifiers in the abstract classification task of BC2  and the best classification system on the full-text scenario of BC2.5  as tallied by the organizers .
In this BioCreative 3 challenge (BC3), we developed a novel and more general version of VTT which utilizes a number of features obtained via Named Entity Recognition (NER) and dictionary tools. We continue the development of this simple linear classifier since it has performed very well in the real-world scenarios of BioCreative, where training and test data are not guaranteed to be drawn from the same distributions of features; the simple linear decision surface seems to generalize the concept of PPI better than more sophisticated classifiers in this context . We show that by expanding the classifier to handle a substantial increase in the amount of NER data, its performance improves significantly. Another interesting feature of the VTT is the interpretability of its simple decision surface, leading to (linear) “rules” for deciding the relevance of literature to PPI.
Throughout the development of our classifier, we analyzed the applicability of various NER and dictionary tools for deciding PPI-relevance. The assessment of appropriate tools is also described in this article, and offered to the community as a large-scale empirical study. In addition, we examine a few other questions related to the VTT and article classification. First, is there a benefit to using word bigrams as textual features, compared to the smaller set of word-pairs we previously employed [3, 4]? Second, does full-text data (when available) benefit classification? This last question is approached only partially here; as full-text data was not fully provided by BC3, we harvested a full-text subset for those BC3 articles that were available through PubMed Central.
The Interaction Method Task (IMT) at BC3, looked beyond the identification of relevant articles, and posed the challenge of finding evidence within full-text biomedical publications concerning the technique used for identifying protein-protein interaction. The task definition made the point that: "A crucial aspect for the correct annotation of experimentally determined protein interactions is to determine the technique described in the article to support a given interaction… For this task, we will ask participants to provide, for each full text article, a ranked list of interaction detection methods, defined by their corresponding unique concept identifier from the PSI-MI ontology". It also required including, as part of the submission for each Interaction Method, the evidence string derived from the text that supports the decision to associate the method with the article.
We thus literally interpreted the IMT task as that of finding, within the text, discussion of the used techniques that can be utilized for detecting PPIs, rather than that of identifying the PPIs themselves. Consequently, we took the approach of looking within the text for sentences that are likely to form evidence for methods being employed, tagging articles with the (likely) methods found. We then provided, in accordance with the BioCreative IMT output specification, for each article the identifiers of these methods, along with a score indicating the level of confidence our system associates with each method. This score reflects how confident the system is in making the association between the method and the article. The sentences within the text on which the association was based were provided as evidence.
Almost all teams participating in the BioCreative III IMT challenge, regarded the method-assignment as an article classification task, in which articles are assigned to one (or more) of the many different PPI methods as categories. In contrast, we have taken a very different route. We focused primarily on identifying potential evidence for the use of methods within the text, and then narrowed the candidate sentences to those who may discuss methods that can be used for PPI detection. Once sentences were found that were likely to bear evidence for the use of a potential PPI method, we scored these sentences with respect to the associated PPI detection method; PPI methods associated with high-scoring sentences were then listed as PPI methods supported by the article, with the high scoring sentences listed as evidence. Thus, the fundamental difference between our system and the other participating systems is that we focused on identifying evidence for potential use of PPI detection methods, while most other systems focused on classifying documents into method-categories, without searching for the explicit evidence.
Moreover, in contrast to other teams, which based their work on using natural language processing (NLP) to identify a variety of components and named entities, including proteins [6–8] and possibly interactions among them , as a fundamental step prior to method detection, we only used simple pattern matching of methods, ranking candidate matches using statistical considerations, without making an attempt at identifying entities. We do believe that NER to identify proteins is likely to improve our system's performance, but as said, we have focused on identification of methods that can be used for identifying PPI, rather than on the PPIs themselves.
Another notable aspect of IMT and its evaluation, is that while the task definition required associating methods with articles, providing the ranking and the strength of the association as well as the evidence supporting it, the evaluation only measured whether the correct method-identifiers were associated with each article, regardless of the strength assigned to this association, and regardless of the evidence. Correctness was determined by comparison of the method identifiers assigned by the system to the method identifiers assigned by human annotators. The evidence, which was requested in the task specification, was not formally evaluated or examined in BC3.
Furthermore, the training data consisted strictly of full text articles along with the PPI detection method tags assigned to the articles by curators, but did not provide any indication or tagging of the evidence within the text supporting this assignment. Similarly, the gold standard released after the challenge does not show this evidence. As such, there is currently no data against which one can evaluate the quality of the evidence produced by the competing systems.
To overcome this shortfall in both the data and the evaluation, immediately following the BioCreative meeting, we have recruited a team of independent annotators to go over the results produced from one of our runs, and constructed a triply-annotated corpus of over 1000 sentences. The section on the Interaction Methods Task, and its Results subsection, provide further detail about the use of this corpus in our evaluation.
Article Classification Task
We participated in both the online (via the BioCreative MetaServer platform) and the offline components of ACT. We used four distinct versions of the most general VTT linear classifier as presented below. The main goal was to study the effect of using various NER and dictionary tools on classification performance. Therefore, the four versions of the VTT vary in the amount and the type of NER data which they use.
Data and feature extraction
Given a labeled training corpus of documents D, let P refer to the set of documents labeled relevant or positive, and N to the set of documents labeled irrelevant or negative; by definition, D ≡ P ∪ N and P ∩ N ≡ ∅. All documents, d ∈ D, are preprocessed by removal of stopwords and Porter Stemming . The stopword list is: i, a, about, an, are, as, at, be, by, for, from, how, in, is, it, of, on, or, that, the, this, to, was, what, when, where, who, will, the, and, we, were (note that the words “with” and “between” were kept). For training data we used the training and development sets released by BC3 for the ACT, as well as the documents released for IMT, which we labeled as positive. This results in a set of 8315 unique documents (3857 labeled positive, and 4458 labeled negative) defined by their PubMed IDs (PMID). To produce textual features (as described below), we oversampled documents from the positive set to obtain a balanced set where |P| = |N| = 4458, |D| = 8916. By oversampling we mean that we randomly selected positive documents to be repeated in the set P. For textual feature selection, as described below, we used only the title and abstract text associated with the PubMed records of these documents. For NER feature selection (see below), we extracted figure caption text and full text from the subset of public-domain documents with PubMed Central records. We denote the full text subset as: D PMC ⊂ D, where |D PMC | = 4190 (≈50% of D).
Let D t refer to the official BC3 test set of documents, which was unlabeled at the time of the challenge, but whose class labels were subsequently provided to the community as a gold standard. This is a highly unbalanced set, with 5090 negative or irrelevant documents, and 910 positive or relevant documents, for a total of |D t | = 6000 documents. Out of these, we were able to obtain PubMed Central records for documents (60% of D t ); 423 positives and 2596 negatives (preserving a similar proportion of negatives to positives as in the overall test set).
Textual feature selection: word-pair and bigram features
The VTT classifier requires textual features to have been obtained from labeled, training documents. In previous versions of VTT, we have used word-pair features similar to bigrams, but which are less computationally demanding to obtain [3, 4]. Here, because we are interested in investigating the benefit of using our word-pairs compared with bigrams, we have used both types of features in different runs of the classifier.
and W D is the set of all unique words in the training corpus D, after pre-processing and stopword removal. The score, S(w), measures the difference between the probability of occurrence of a word w in relevant documents, p P (w), and the probability of occurrence in irrelevant ones, p N (w). Each document in the set D is subsequently converted into an ordered list comprised of a subset of these 1000 words, w ∈ W. The list representing each document is ordered (with repetition) according to the sequence in which the words occur in the original text. That is, the original sequence of words in the text, is converted into a sequence that contains only words w ∈ W; all words not in the top 1000 set, W, are removed. The top 10 (stemmed) words and their S score in the training data for BC3 were: interact (0.41), protein (0.4), bind (0.33), domain (0.27), complex (0.26), regul (0.24), activ (0.21), here (0.19), phosphoryl (0.16), function (0.15).
Top 10 SP features ranked with the S score.
w i , w j
Top 10 bigram features ranked with the S score.
w i , w j
One side goal of this work was to investigate whether the computational overhead of bigram extraction is worthwhile. Notably, the generation of SP features requires two iterations over each document: one to extract the single word features, and another to obtain the occurrence counts of SP features after ranking of single word features over the entire training corpus. In contrast, bigrams in principle require a single iteration over each document to extract occurrences. However, there are many more unique observable bigrams than unique single word features, due to the possible combinations of single words with one another. In contrast, the second pass to compute SP features is not over the entire document text, but over the ordered lists containing only the top (1000) single words, which results in a much smaller set of possible word pairs. Therefore, in a large corpus the list of bigrams to store and index for tallying occurrences is much larger than that of SP features, resulting in a substantial computational overhead. One other possible issue is that of finding the optimal number of top scoring words selected to produce SP features. We showed in an earlier publication  that the S score histogram can guide us to identify a good threshold number after which no improvement results. We used this technique here.
For simplicity, in the remainder of the article, unless otherwise specified, we refer to textual features simply by the symbol w.
Entity count features: data from entity recognition tools
In our previous work with a simpler version of VTT for BC2 , we used as an additional feature the number of proteins mentioned in abstracts, as identified by the NER tool ABNER . More recently, in BC2.5, we used the same additional feature in distinct sections of full text documents, and observed that terms extracted from domain ontologies did not help in article classification . Here, we pursue a much wider investigation of the utility of using terms from NER and dictionary tools available to the community.
What we use for VTT are entity count features: for each document d ∈ D, we compute the number of occurrences nπ(d), of each entity type π. An example of an entity type is “protein mentions” as identified by ABNER. Naturally, in the context of BC3, we are interested in the entity count features that can best discriminate documents relevant for PPI (positive) from irrelevant ones (negative). For that purpose, we utilized the NER tools ABNER [2, 10], NLProt  and OSCAR 3 [12, 13] and compiled dictionaries from the BRENDA (enzymes) [14, 15] and ChEBI (chemical compounds)  databases, as well as the PSI-MI ontology (experimental methods) .
With each one of these tools we extracted various types of entity count features in abstracts for all documents, d ∈ D, and also in figure captions and full text of the subset of documents available in PubMed Central, d ∈ D PMC . Examples of entity count features we collected are the number of protein mentions in an abstract identified by NLProt, and the number of PSI-MI method mentions in figure captions.
ABNER protein mentions in abstracts
NLProt protein mentions in abstracts
OSCAR compounds in abstracts
ABNER protein mentions in figure captions
PSI-MI methods in full texts
Approach: variable trigonometric threshold classifier
where w denotes a textual feature such as SP or bigram as described above. In other words, P(d) sums the cosine contributions of every occurring feature w in document d, when projected on the p P /p N plane. N(d), in turn, sums the sine contributions of every occurring feature w.
The right-hand side expression of Eq. (1) specifies a decision threshold for a document, given its ratio of positive and negative textual feature contributions (on the left-hand side). This decision threshold is defined by a constant, λ0, and a variable component, defined by entity count features. The idea is that information from NER data can alter the decision threshold. For instance, in Figure 2 we can see that 90% of all positive documents in the training data set, D, have 5 or more ABNER-extracted protein mentions, whereas only 40% of negative documents have the same number of mentions. Therefore, when a given document, d, contains more than 5 ABNER-extracted protein mentions, we can expect it to have a higher chance of being relevant. To introduce this type of information into the decision threshold, the VTT classifier is defined for M=|EP-EN| entity count features, EP of which are positively correlated with positive documents (such as ABNER protein mentions), and EN of which are positively correlated with negative documents. For simplification, we refer to the first as positive entity count features, and to the second as negative entity count features.
Each positive entity count feature π adjusts the decision threshold for document d with the factor (β π – n π (d))/β π , where β π is a constant parameter; when n π (d) >β π , the threshold is lowered, and increased otherwise. Each negative entity count feature ν adjusts the decision threshold for document d with the factor (n v (d) – β v )/β v , where β v is a constant; when n v (d) >β v , the threshold is increased, and lowered otherwise. The β parameters represent the neutral threshold point for the respective entity count feature: when n π (d) = β π , there is no threshold adjustment from information about entity count feature π.
In addition to the class decision, computed by the VTT decision surface (Eq. (4)), we ranked positive documents by decreasing value of C (Eq. (5)), followed by negative documents ranked by increasing value of C.
Another interesting feature of the plot is the easy identification of the point of no threshold adjustment. When n π (d) = β π and n v (d) = β v for all π and ν entity count features, y(d) = 1 ⇔ T(d) = λ – 1 (see Figure 4, right). This means that NER information is neutral and the decision (x(d) >λ – 1) is exclusively made by the value of x(d) computed from textual features via Eq. (3).
Notice that the value of x(d) in Eq. (3) can be undetermined if N(d) = 0. Therefore, if P(d) = N(d) = 0, which means there is no information from textual features about document d (no textual feature occurs in d), we compute x(d) = λ – 1, which means that decision is exclusively made by NER information. Additionally, if P(d) > 0 ∧ N(d) = 0, we compute x(d) = (λ – 1). P(d), which means that the decision is made by using NER information as well as the contributions from textual features for a positive decision.
Experimental setting: training and submissions
where TP, TN, FP, FN denote true positives, true negatives, false positives, and false negatives, respectively.
The search is performed as follows:
1. Set all β π to the values that maximize |p P (n π (d) ≥ x) - p N (n π (d) ≥ x)|, as observed in entity count feature charts (see above). Same for β v .
2. Search λ, β π , β v space with coarser steps around values set in 1. Search λ widely.
3. Collect the most common values of λ, β π , β v in the top echelon of classifier parameter sets obtained by the rank product of performance measures. All classifiers in the top echelon have the same value of rank product.
4. Search more finely around values obtained in 3.
This search procedure rewards not only the top performing classifiers, but also those parameter ranges whose performance is robust to small changes in the other parameters. This is achieved in step 3 of the search procedure, when we select the most common values of parameters in the (initially large) set of top performing classifiers. Because VTT is very simple to compute, the search can be done in a pretty exhaustive manner, depending on the number of parameters needed for entity count features. We provide Excel worksheet demos of the VTT surfaces and parameter search code in supplementary materials at http://cnets.indiana.edu/groups/casci/piare. These simple demos are capable of searching the entire space of BC3 data, which highlights how computationally simple the classifier is.
We set out to investigate (1) if additional NER information can improve PPI article classification, (2) if there is a performance cost to using SP instead of bigram word-pair features, and (3) if the addition of full text information improves classification. To answer these questions, we submitted different versions of the VTT algorithm described below.
No NER information, VTT0
The decision is solely defined by the sums of the (cosine and sine) contributions from the textual features for document d. We submitted two variations of this classifier: one computed with SP features and the other with bigrams. Since this VTT version only uses textual features extracted from titles and abstracts, these two classifiers do not use any data from the full-text documents in D PMC (see feature extraction above).
ABNER Protein mentions in abstracts, VTT1
where β and n(d) refer to ABNER protein mentions in abstracts and λ is a constant. The initial value of β for the search algorithm (training) above is chosen as the value that maximizes the difference of occurrence probabilities of this entity count feature between the positive and the negative documents, as depicted in Figure 2: β=5. We submitted two variations of this classifier: one computed with SP features and the other with bigrams. These two classifiers also do not use any data from the full-text documents in D PMC .
With all NER data, VTT5
where λ is a constant. Notice that because entity features 4 and 5 are extracted from full text documents, for a substantial number of documents these features do not exist in our dataset. To account for that, when a document d does not have full text (d ∉ D PMC ): n4(d) = β4 and n5(d) = β5, i.e. for these documents, the VTT classifier assumes the point of neutral NER information for entity features 4 and 5. The initial values of β1, β2, β3, β4 and β5 for the search algorithm (training) were obtained by inspection of the charts in Figures 2 and 3, and are set to 5, 10, 15, 5, and 40, respectively. We submitted two variations of this classifier: one computed with SP features and the other with bigrams.
With NER from abstracts only, VTT3
Parameter values for submitted classifiers after parameter search.
Performance of submitted classifiers on training data.
Performance of submitted classifiers on test data.
From our NER and dictionary tools analysis, we identified publically available resources that benefit the classification of PPI-relevant documents. Based on this analysis we selected 5 entity count features, the behavior of which for PPI classification is presented in Figures 2 and 3. Similar charts for all tools and features tested are provided in supplementary materials, including those for rejected tools. Knowledge about the behavior of these tools for PPI article classification is one of the contributions of this work.
During the challenge, our system (both online and offline) was severely hindered by various software and integration errors. The errors included: overwritten values of the entity count features in our database, which effectively randomized the values of these features for the test set documents; an error in the computation of the confidence measure given by Eq. (5), which tended to return the same value for most documents in the test set; and an error in the classification surface of VTT leading to many incorrect class labels.
The various versions of the VTT classifier described above were submitted as different runs, but not at all with the correct class labels and confidence values. Therefore, the official BC3 results for our system are not only very low, but have no value with respect to the questions we set out to answer. After the challenge, we corrected all errors and computed new performance measures using the BC3 evaluation script and gold standard. Naturally, we trained the corrected classifiers without using any information from the gold standard. Demos are provided with our training (and parameter search) procedure in supplementary materials, to allow our results to be reproduced.
Central tendency and variation of the performance of all runs submitted to ACT on the official BC3 gold standard, including our original and our corrected runs.
Mean + 95% CI
Performance of top 10 reported runs to ACT in BC3.
Interaction Methods Task
Approach and tools
Identifying method sentences
To find candidate evidence passages in text, we used classifiers developed and reported in an earlier work by Shatkay et al., which were trained on a corpus – unrelated to protein-protein interactions – of 10,000 sentences taken from full-text biomedical articles, and tagged at the sentence-fragment level. Each sentence in that corpus was tagged by three independent biomedical annotators, along five dimensions: focus (methodological, scientific or generic), type of evidence (experimental, reference, and a few other types), level of confidence (from 0 – no confidence, to 3 – absolute certainty), polarity (affirmative or negative statement), and direction (e.g. up-regulation vs. down-regulation), as described in an earlier publication . The corpus itself is publically available at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2678295/bin/pcbi.1000391.s002.zip
While the corpus had little or nothing to do with protein-protein interaction, the Support Vector Machine (SVM) classifier (implemented using LibSVM ), trained along the Focus dimension, showed high specificity (95%), sensitivity (86%) and overall F-measure (91%) in identifying Methods sentences. As such, we have used it without any retraining.
Using the converted text files provided by BioCreative, we broke the text into sentences (using the Lingua-EN-Sentence Perl module ), and eliminated bibliographic references employing simple rules. Namely, in articles that contained a Reference heading, sentences following the heading were removed; when the Reference heading was absent, regular-expressions (based on simple patterns for identifying lists of authors, and publication dates) were used to remove likely references. The remaining sentences were represented as term vectors (as described in an earlier work ) and classified according to their focus, utilizing the SVM classifier as mentioned above, thus identifying candidate sentences that are likely to discuss methods. While we also experimented with the classifiers trained to tag text along the other dimensions, as almost all sentences were of affirmative polarity and high confidence, we decided to use only the Focus classifier; particularly, using the pertinent aspect of whether or not a sentence was classified as a Method sentence.
The Methods Identifiers (MI) dictionary
In order to associate the actual method identifiers with the classified sentences, we used dictionary-based pattern-matching against PSI-MI ontology terms . To construct the dictionary, we obtained all the PSI-MI terms listed under the “Interaction Method” (MI:0001) branch of the ontology using the Perl module OBO::Parser::OBOParser. The individual words within all the terms, both in the text and in the dictionary, were all stemmed using the Perl module Lingua::Stem that implements the Porter stemmer . Stemming was applied because our early experiments, without stemming, showed inferior results (data not shown). The dictionary was extended to include individual (stemmed) words occurring within the PSI-MI terms, as well as bi-grams and tri-grams of individual words occurring consecutively within the terms, produced using the Perl module Text::NGramize. Words that are hyphen-separated within PSI-MI were included in the dictionary twice, using two forms: one in which the hyphens are replaced by spaces (thus separating the words), and another in which the hyphen is removed and the words are treated as one single composite word. The two forms allow matches against free text in which the same composition appears either completely un-hyphenated (space delimited) or collapsed into one word.
Two special cases emerged from the training set and received special treatment: (i) the tool pdftotxt, used by BioCreative to convert articles into plain text, consistently converted the words "fluorescence" into "orescence"; to correct for that we introduced the term orescence into the dictionary, as a synonym for the term fluorescence microscopy (MI:0416); (ii) similarly, we added the synonyms “anti tag immunoprecipitation” and “anti bait immunoprecipitation” for “anti tag co immunoprecipitation” (MI:0007) and “anti bait co immunoprecipitation” (MI:0006) respectively. These two methods are by far the most common methods identified in the training set (over 700 assignments of each, as opposed to about 480 assignments of the next popular method, MI:0096, pull-down). This addition ensures that occurrences within the text of the terms "anti tag immunoprecipitation" and "anti bait immunoprecipitation" constitute an exact match to MI:0006 or MI:0007 respectively, rather than an erroneous exact match to the more generic method "immunoprecipitation" MI:0019.
We note that while the dictionary above is based on the whole PSI-MI ontology, our final reported results consider only sentences that match terms from the reduced list of Molecular Interactions identifiers provided by BioCreative, at http://www.biocreative.org/media/store/files/2010/BC3_IMT_Training.tar.gz.
Matching against the dictionary
Pattern matching of text against the dictionary entries was implemented using the Perl rewrite system Text::RewriteRules. The system was customized to support both full and partial matches; to avoid a large number of spurious matches it was adjusted to prefer longer matches over shorter ones, and perfect matches over partial ones. The Perl module Lingua::StopWords was used to avoid the matching of common English words. Sentences within which matches to the dictionary were identified, were then scored as described next.
As discussed above, each sentence was tentatively associated with all the MIs whose terms (partially) matched the sentence. Statistical considerations were then used to post-process the tentative matches. When multiple MIs hit the same sentence overlapping the same word, a single MI had to be selected; similarly, a single sentence was selected as evidence for each matched MI.
We assigned a score to each sentence that was matched by an MI, based on several statistical considerations involved in associating a MI to a sentence and based on the Focus label assigned to the sentence, as described in the first part of this section. We first calculated an un-normalized score, which is a positive number that can be greater than 1. We normalized all scores to be between 0 and 1 as a final step.
The first component, MIScore( S i , MI j ) is calculated based on several counts indicating how strong the association of the method identifier MI j is with the sentence S i . This score is proportional to the length of the matched portion of the synonym for the MI within the sentence, measured both in characters and in words; the score is inversely proportional to the likelihood of the MI to match a sentence by chance, based on the frequency in which words from the MI synonyms occur in the dataset. To formally define the MIScore, we denote by Hit(S i , MI j ) the (partial) match of any synonym of the method MI j within sentence S i , and by |Hit(S i , MI j )| the number of characters within S i that actually matched the synonym. The MIScore itself is then calculated as the sum of the three following summands:
The number of times the method identifier MI j matches a sentence in article d denotes the count of any (full or partial) matches by any synonym included in the dictionary entry for MI j . The term |D| denotes the total number of articles in the set of articles, D. The log function and the multiplication by 0.5 puts Score1 in the same numerical range and order of magnitude as Score2 and Score3 below, and are hence employed.
When multiple candidate MIs match a sentence while sharing some of the same words in their match, the MI who has the largest number of matched words is retained as a candidate match for the sentence. In case of a tie between two possible MIs with the same number of matching words, the MI with the longest match as measured in characters (rather than in words) is retained.
Finally, the evidence for a specific method MI j , denoted as Ev(MI j ), within an article d, is the sentence S i for which the raw score, RScore( S i , MI j ), is the highest among all other sentences within the article in which a partial match was found for a synonym of the method MI j . Formally, for an article d, and a method identifier MI j , the evidence for MI j in d is: , and the score of this evidence is the RScore of the sentence that maximized the expression on the right hand side.
is un-normalized, and as such is a positive number not necessarily in the range [0, 1] as required by BioCreative. The raw scores are normalized per article, by dividing each raw score by the maximum raw score assigned to any pair of method identifier and sentence within the article. The latter step guarantees that the normalized score is always at most 1.
To produce the different runs submitted to BC3, as well as the runs described here which were produced after the workshop, the same matching and scoring algorithms were used for all runs; the difference between the different runs is merely in the threshold employed on the raw scores of evidence per method, used in order for the MI to be included or excluded in the submitted results report.
In the five runs submitted (results provided in Table S.1 of the supplementary material), Run 1 included the top 40 results for each document, while Run 5 included only methods and evidence with a raw score above 4.5 (before normalization). Unfortunately, the official runs submitted to BC3 were all produced using an erroneous code, mis-executing the pattern matching step against the dictionary and missing many valid matches. After the official submission, the errors in the code were corrected and thus the runs and the results have changed. As such, we do not provide further details on the official runs aside for reporting the official results in Table S.1, as these runs reflect a computation error rather than a methodological aspect.
The results provided in Tables 9 and 10 include four runs: One produced without any filtering, reporting all methods that partly matched each article, giving rise to a very high recall and low precision; the second reporting the top 40 scoring MIs for each article; the third reporting only MIs whose raw-score was higher than 6; and the fourth reporting only MIs whose raw-score was higher than 7. As expected, and as seen in Tables 9 and 10, the recall decreases while the precision increases with each consecutive run among these four.
Independent Evaluation of the Results by Human Annotators
As our approach focused primarily on obtaining evidence for PPI-detection methods within the text, and as the BioCreative evaluation did not score this required evidence, in order to examine the quality of the evidence produced by our system, we have recruited a group of five independent annotators, all holding academic degrees in Biology and studying toward advanced degrees (MSs or PhD) in Molecular Biology, all proficient in the English language, and all experienced in reading and using scientific literature – particularly in areas within proteomics.
The annotators were given all the sentences produced as evidence by our system in one of our runs (the run corresponding to the third row in Table 9), a set consisting of 1049 sentences. Each sentence was independently labeled by three different annotators, each assigning one of three possible letter-tags to the sentence, indicating whether/how the sentence relates to methods for detecting protein-protein interaction (PPI). The tags were defined as follows:
Y - if the sentence discusses a method which can potentially be applied for detecting protein-protein interaction.
M - if the sentence discusses a method, but the method is absolutely NOT a protein-protein interaction detection method.
N - if the sentence does not discuss a method whatsoever.
When annotators assigned the label "Y", they also had to assign a numeric label, indicating the actual protein-protein interaction content of the sentence, as follows:
2 - If protein-protein interaction (PPI) is directly and explicitly mentioned within the sentence, along with the method of detection.
1 - If PPI is implied in the sentence, along with the method of detection, but the PPI not explicitly stated.
0 - If PPI is neither implied nor mentioned in the sentence.
The sentence in the last case is not about PPI. That is, the sentence talks about a method; the method – to the best of the annotator's knowledge – has the potential to detect PPI, and hence labelled Y in the first place; but the sentence does not indicate that the method was actually applied to detecting PPI.
The inter-annotator agreement was high, as indicated by 65% of the sentences on which all three annotators assigned the exact same letter-tag, (a rate much higher than the 11% expected by chance, of three people assigning the same label out of three possible labels), and over 98% in which at least two annotators agreed on the letter-tag. That is, on only 17 sentences out of the 1049 there was a three-way disagreement in tag-assignment, much lower than the number expected by chance (which is about 220 sentences with total disagreement when labelling about 1000 sentences using 3 labels). The above details are provided to clarify the major characteristics of the corpus and the reliability of the annotators. Further details about this annotation effort, the corpus, and its potential utility, are beyond the scope of this paper and will be provided in a separate publication in the near future.
We have submitted five official runs to BC3, all using the same basic strategy, varying only in the threshold of the scores applied to the data, and thus in the stringency of the filtering process. Therefore, the runs range from those favouring recall to those favouring precision. As mentioned above, the official submitted runs were produced by a version of our code that contained errors, and the resulting values were very low, both in terms of precision and in terms of recall, as well as by any other measurement. While we provide the results of these runs for the sake of completeness in the supplementary material (Table S.1), they carry no value in terms of evaluating the method described here in-and-of itself.
IMT Runs on the training set (after code correction)
Total Docs Evaluated
Runs on the test set (after code correction)
Total Docs Evaluated
In both tables, the first row, labelled All, contains the results for a run in which all PPI detection methods that had any synonym partially-matched in any sentence, was reported as a PPI detection method relevant to the article. This run obviously has a very high recall at the cost of a very low precision. The next row (Top 40) shows the results from a run in which the forty top scoring MIs in each article are reported. The next two rows in both tables, report results of runs in which the criterion for including MIs was more strict, and required an un-normalized score, RScore, of at least 6 (run 3) or at least 7 (run 4).
Summary of evaluation by three human annotators, over 1049 evidence sentences for PPI methods.
# of sentences tagged by the Majority as Label
% of sentences tagged by the Majority as Label
Integrating the IMT system into the ACT pipeline
We also experimented with using the output of the IMT in support of the ACT pipeline. Since our IMT system is focused on obtaining evidence for the interaction methods used, we investigated what happens to the entity count features when we crop the original document and keep only the evidence text extracted by the IMT system. That is, the entity recognition is performed not on the original text, but on the evidence portions that the IMT system outputs. We performed the same analysis of entity count features on the IMT-cropped training data. Specifically, we identified those entity count features for which |p P (n π (d) ≥ x) - p N (n π (d) ≥ x)| ≥ 0.3 (see entity count feature section).
Since the IMT-cropped data contains substantially less text than the original documents, the processing time for NER and dictionary tools on the training and the test data is considerably reduced. The mean number of words per full-text article within the BioCreative corpus is 5,295.8 (Std. Dev. 1,878.6), whereas the mean number of words for an IMT-cropped document is 180.0 words (Std. Dev. 161.9). For tools such as NLProt and OSCAR, this represents more than 10 fold reduction in processing time (see supplementary material). Moreover, we observed that the characteristics of the entity count features are conserved in the IMT-cropped training data: the same 5 features emerge as positively correlated with positive documents (relevant charts are provided in supplemental materials).
This result is significant because it can save considerable computation time in future implementations of our pipeline within a curation effort.
Discussion and conclusion
The Article Classification Task
The VTT5 classifier resulted in a ranking and classification performance substantially higher than all the reported submissions to the BC3 challenge, in terms of AUCiP/R, MCC, and F-Score (see results above). To address the questions raised in the beginning of this paper, we now consider the differences between the various versions of VTT. Clearly, adding the NER information improves PPI article classification. Not only is the VTT5 method quite competitive when compared with all the submissions to BC3, but we can quantify the improvement in VTT performance by comparing the various versions of the method in Table 5. The AUCiP/R of VTT5, with SP features, is 0.1937 higher than that of VTT1, which is in turn 0.0467 higher than that of VTT0. To gauge the significance of this improvement, vis a vis the variation in performance of all classifiers submitted to BC3, consider that the standard error and 95% confidence interval of the mean of AUCiP/R is 0.02 and 0.04, respectively (see Table 6). The relative performance improvement from one version of VTT to another, means that including ABNER protein mentions in abstracts alone, leads to a gain of almost 9.5%, and including the additional 4 entity count features leads to an additional gain of 35.9% in terms of the AUCiP/R measure. Therefore, the inclusion of several entity count features in VTT improved the ranking ability of the classifier significantly, which is what is primarily measured by AUCiP/R. The inclusion of NER information also improved substantially the classification ability of VTT as measured by Accuracy (VTT0→VTT1: 1.4% and VTT1→VTT5: 5.2%), F-Score (VTT0→VTT1: 5% and VTT1→VTT5: 14.4%), and MCC (VTT0→VTT1: 7.6% and VTT1→VTT5: 20.1%), the latter being the measure best suited for unbalanced scenarios. The performance of each version of the VTT, as reported in Table 5, can be contrasted to the central tendency and variation of the performance of all classifiers in Table 6. The improvement in terms of the rank product for all submissions to the ACT is also worthy of notice: out of 58 runs, VTT0 was the 38th best classifier, VTT1 was the 24th best, and VTT5 was the best classifier. According to every performance measure, the largest improvement comes from including all of the entity count features. Therefore, there was much to gain by adding information from NLProt, PSI-MI, and OSCAR in addition to information from ABNER.
Regarding the textual features used, it is also quite clear from our results that using bigram textual features leads to worse performance than using the computationally less demanding SP features. We can see in Table 5 that for every version of VTT used, the SP features always outperformed bigrams for the AUCiP/R, F1, and MCC measures. The exception is when it comes to the Accuracy obtained for VTT0 and VTT1; in these cases, the accuracy was larger when using bigrams. But since accuracy is not as informative in unbalanced scenarios, and because the accuracy of the top performing VTT5 classifier was larger when using SP features, we can conclude that SP features lead to a better performance than bigrams. This suggests that SP features, by using only constituent words with high S score (see textual feature selection section), generalize the concept of PPI more effectively than bigrams. We conclude that not only is the use of the small set of SP features much more computationally efficient, it also leads to better performance of the VTT classifier.
Since two of the entity count features used on the best VTT classifier are derived from full-text data when available (via PubMed Central), i.e. based on ABNER protein mentions in figure captions (feature 4) and on PSI-MI methods in the full text (feature 5), we can conclude that full-text is at least partially responsible for the excellent performance reached by this classifier. However, as full text data was only available for 60% of the documents in the test set (see data and feature extraction section), it cannot be fully responsible for the performance improvement. To further examine this point, we computed a version of the classifier, VTT3, that does not utilize these two entity features. While the performance of VTT3 in the training data is just slightly lower than VTT5 (see Table 4), on test data it is noticeably lower (see Table 5). We observe that inclusion of the full text features lead to approximately a 3% improvement in all performance measures. In comparison to all reported classifiers, VTT3 is below the top two classifiers reported by team 73 (lead by W. John Wilbur at NCBI, Runs 2 and 4) as well as both the SP and bigram versions of VTT5. Therefore, we conclude that the inclusion of data from full-text documents, even if available for little more than half of the documents in the training and test corpora, was useful and indeed contributed to obtaining the top reported classification and ranking system.
Besides its very competitive performance, the VTT classifier (in all versions tested) is defined by a simple linear surface that can be interpreted. Indeed, we can look at the parameters of Table 3 (obtained via the training algorithm) and discern a “rule” of what constitutes a PPI-relevant document. We only uncovered 5 entity features positively correlated with positive documents (see entity count features section), therefore confidence in PPI-relevance increases linearly with all those features. Looking at the specific β parameter values in Table 3 for VTT5, we can discern a rule that states: “a document with a few ABNER protein mentions, many NLPROT protein and CHEBI chemical compound mentions in the abstract, a few ABNER protein mentions in figure captions, and many PSI-MI method mentions in the full-text tends to be PPI-relevant”. The exact rule is of course defined by the VTT surface equation, but its linear nature allows us to discern the type of (vague) linguistic “rule-of-thumb” above, which is nonetheless meaningful. It is interesting to notice that the same rule emerges for both SP and bigram features.
The Interaction Method Task
For the IMT, the results shown in Tables 8 and 9 demonstrate that employing the scores, as shown in the three bottom rows of each table, leads to higher precision and lower recall than simply employing pattern matching (the first, All run in both tables). This suggests that the scoring scheme proposed helps to focus attention on sentences that are likely to contain PPI detection methods, although the resulting performance as measured by BioCreative is still low.
The distribution of the secondary labels for sentences tagged as Y by majority of annotators
# of sentences tagged by the Majority as Label
% with respect to all Y-tagged sentences (755)
% with respect to all sentences (1049)
Notably, there is some discrepancy between the BioCreative evaluation and the values assigned to the results by our group of human annotators. According to the BC3 formal evaluation, as shown in Table 9, the precision of the third run (RScore ≥6) is about 26%. In contrast, as shown by Table 10, annotators who are also familiar with PPI detection methods and who read the sentences, deemed about 70% of the evidence for MIs produced by our system as discussing methods that are applicable to PPI detection. Moreover, as Table 11 shows, the annotators viewed about 35% (counts for Y1 and Y2 combined) of the sentences produced by our system to contain evidence that the methods were indeed applied toward the detection of PPI. In more than half of those (Y2, 19% of the total) the interacting proteins could be detected by the annotators, while in the remaining (Y1, 16% of the total) the interacting proteins were implicit rather than explicitly stated – but interaction detection through the application of the indicated method was still discussed. The above variability highlights the complexity and the possible ambiguity involved in the definition, the interpretation, and the evaluation of the IMT task.
A closer examination of individual sentences further demonstrates these differences in interpretation and evaluation of the task. Below are examples of evidence sentences that our system produces, found in articles that the BC3 gold standard judges as False Positive, but who appear to discuss PPI along with the method to detect it. The examples are formatted using the BC3 requested format, showing (in the required order, from left to right), the PubMed identifier of the article, the MI associated with it, along with the rank in the list (4, 6 and 4 in the three examples below) and the confidence score (the floating point number), followed by the evidence sentence itself:
19224861 MI:0096 4 0.865173475604312 We found that PEDF was pulled down with Ni-NTA beads when the binding reactions included His-tagged LR or His-tagged LR90 (Fig. 2 G).
18806265 MI:0114 6 0.620645021811025 Previous x-ray crystallography analyses suggest that CARD-CARD interactions occur via interaction between the 23 helical face, and the 1 4 helical face (50).
18819921 MI:0663 4 0.79176182685558 Using confocal microscopy, we show that trapping mutants of both PTP1B and the endoplasmic reticulum targeted TCPTP isoform, TC48, colocalize with Met and that activation of Met enables the nuclear-localized isoform of TCPTP, TC45, to exit the nucleus.
These examples demonstrate the complexity in the task definition and in its evaluation criteria. The first example appears to be a description of experimental results observed by the authors. In contrast, the second of the three example sentences refers to a "Previous" experiment and provides a reference "(50)". Curators whose explicit task is defined as finding only novel experimental evidence may view the sentence as not useful – because the evidence is not new; this is likely to be the reason why this method was not assigned to the document within the BC3 gold standard. However, these same curators can still use this sentence to back-trace the reference and recover the evidence from the original referenced paper (50). Furthermore, curators and scientists that are tasked with identifying all the evidence in support of an interaction, without the requirement for novelty, will still view the sentence as relevant evidence for the interaction. Notably, the BC3 IMT did not require novelty of evidence as part of the task specification. The third sentence primarily discusses the detection of co-localized proteins rather than of a direct interaction; as such it can be viewed by some curators as relevant and by others as irrelevant.
To summarize, while the utility of each specific sentence, as shown in the example above, may depend on the exact definition of the curation task, automatically identifying and highlighting such sentences can significantly narrow down the amount of text that a curator needs to examine. The above three examples all help to demonstrate the value of our method in identifying evidence sentences that are likely to be useful.
As a last point, we note that the time required for running our pipeline is realistic within a curation effort. For instance, for processing the test set of about 300 full text documents, the complete processing time was about 28 minutes (an average processing rate of over 10 documents per minute), of which about 12 minutes were consumed by the classification of each sentence along the various dimensions (Focus, Evidence etc.) by the multi-dimensional classifiers . Most of the steps, including the classification of the sentences, can be readily performed off-line and parallelized to process multiple sentences simultaneously. Thus, the ideas presented here can be readily incorporated into an effective and useful curation pipeline.
We thank the annotators from Sharon Regan's lab and the department of Biology at Queen's University: Kyle Bender, Daniel Frank, Kyle Laursen, Brendan O'Leary and Hernan Del Vecchio. Their work as well as that of Andrew Wong's was supported by HS's NSERC Discovery and Discovery Accelerator awards #298292-08 and CFI New Opportunities Award 10437. Michael Conover and Azadeh Nematzadeh were supported with a grant from the FLAD Computational Biology Collaboratorium at the Instituto Gulbenkian de Ciencia in Oeiras, Portugal. We also thank support from these grants for travel, hosting and providing facilities used to conduct part of this research. We thank Artemy Kolchinsky for assistance in setting up the online server for the ACT.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 8, 2011: The Third BioCreative – Critical Assessment of Information Extraction in Biology Challenge. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S8.
- Krallinger M., Valencia A: BioCreative III, PPI Task.2010. [http://www.biocreative.org/tasks/biocreative-iii/ppi/]Google Scholar
- Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 2005, 21: 3191–3192. 10.1093/bioinformatics/bti475View ArticlePubMedGoogle Scholar
- Abi-Haidar A, Kaur J, Maguitman A, Radivojac P, Retchsteiner A, Verspoor K, Wang Z, Rocha LM: Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks. Genome Biology 2008.Google Scholar
- Kolchinsky A, Abi-Haidar A, Kaur J, Hamed AA, Rocha LM: Classification of protein-protein interaction full-text documents using text and citation network features. IEEE/ACM Trans Comput Biol Bioinform 2010, 7: 400–411.View ArticlePubMedGoogle Scholar
- Leitner F, Mardis SA, Krallinger M, Cesareni G, Hirschman LA, Valencia A: An Overview of BioCreative II.5. IEEE/ACM Trans Comput Biol Bioinform 2010, 7: 385–399.View ArticlePubMedGoogle Scholar
- Wang X, Rafal Rak R, Restificar A, Nobata C, Rupp C, Batista-Navarro R, Nawaz R, Ananiadou S: Detecting Experimental Techniques and Selecting Relevant Documents for Protein-Protein Interactions from Biomedical Literature. BMC Bioinformatics 2011, 12(BioCreative Supplement):S6.PubMed CentralPubMedGoogle Scholar
- Rinaldi F, Schneider G, Clematide S, Romacker M, Vachon T: Detection of Interaction Articles and Experimental Methods in Biomedical Literature. BMC Bioinformatics 2011, 12(BioCreative Supplement):S9.Google Scholar
- Krallinger M, Vasquez M, Leitner F, Salgado D, Chatraryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M, Castagnoli L, Cesareni G, Tyers M, Schneider G, Rinaldi F, Leaman R, Gonzalez G, Matos S, Kim S, Wilbur WJ, Rocha LM, Shatkay H, Tendulkar AV, Agarwal S, Liu F, Wang X, Rak R, Noto K, Elkan C, Lu Z, Islamaj Dogan R, Fontaine J, Andrade-Navarro MA, Valencia A: The Protein-Protein Interaction Tasks of BioCreative III: Classification/Ranking of Articles and Linking Bio-Ontology Concepts to Full Text. BMC Bioinformatics 2011, 12(BioCreative Supplement):S15.Google Scholar
- Porter MF: An algorithm for suffix stripping. Program 1980, 14: 130–137. 10.1108/eb046814View ArticleGoogle Scholar
- Settles B: Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA) 2004, 104–107.View ArticleGoogle Scholar
- Mika S, Rost B: NLProt: extracting protein names and sequences from papers. Nucleic Acids Res 2004, 32: W634-W637. 10.1093/nar/gkh427PubMed CentralView ArticlePubMedGoogle Scholar
- Batchelor C, Corbett P: Semantic enrichment of journal articles using chemical named entity recognition. In the 45th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2007:45–48.Google Scholar
- Corbett P, Batchelor C, Teufel S: Annotation of chemical named entities. BioNLP 2007: Biological, translational, and clinical language processing 2007, 57–64.View ArticleGoogle Scholar
- Barthelmes J, Ebeling C, Chang A, Schomburg I, Schomburg D: BRENDA, AMENDA and FRENDA: the enzyme information system in 2007. Nucleic Acids Res 2007, 35: D511-D514. 10.1093/nar/gkl972PubMed CentralView ArticlePubMedGoogle Scholar
- Schomburg I, Chang A, Schomburg D: BRENDA, enzyme data and metabolic information. Nucleic Acids Research 2002, 30: 47–49. 10.1093/nar/30.1.47PubMed CentralView ArticlePubMedGoogle Scholar
- Degtyarenko K, De Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Research 2008, 36: D344-D350.PubMed CentralView ArticlePubMedGoogle Scholar
- Chatr-aryamontri A, Kerrien S, Khadake J, Orchard S, Ceol A, Licata L, Castagnoli L, Costa S, Derow C, Huntley R, Aranda B, Leroy C, Thorneycroft D, Apweiler R, Cesareni G, Hermjakob H: MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data. Genome Biol 2008, 9(Suppl 2):S5. 10.1186/gb-2008-9-s2-s5PubMed CentralView ArticlePubMedGoogle Scholar
- Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. Febs Letters 2004, 573: 83–92. 10.1016/j.febslet.2004.07.055View ArticlePubMedGoogle Scholar
- Kim S, Wilbur WJ: Classifying protein-protein interaction articles using word and syntactic features. BMC Bioinformatics 2011, 12(BioCreative Supplement):S16.Google Scholar
- Shatkay H, Pan FX, Rzhetsky A, Wilbur WJ: Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users. Bioinformatics 2008, 24: 2086–2093. 10.1093/bioinformatics/btn381PubMed CentralView ArticlePubMedGoogle Scholar
- Wilbur WJ, Rzhetsky A, Shatkay H: New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics 2006, 7: 356. 10.1186/1471-2105-7-356PubMed CentralView ArticlePubMedGoogle Scholar
- Chang C, Lin C: LIBSVM: A Library for Support Vector Machines.2001. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]Google Scholar
- Yona S: CPAN module Lingua-EN-Sentence.2010. [http://search.cpan.org/~shlomoy/Lingua-EN-Sentence0.25/lib/Lingua/EN/Sentence.pm]Google Scholar
- HUPO Proteomics Standards Initiatives (PSI), Molecular Interaction (MI)2010. [http://psidev.sourceforge.net/mi/rel25/data/psi-mi25.obo]
- Antezana E: CPAN module ONTO-PERL.2010. [http://search.cpan.org/~easr/ONTO-PERL-1.23/]Google Scholar
- Franz B: CPAN module Lingua-Stem.2010. [http://search.cpan.org/~snowhare/Lingua-Stem-0.84/]Google Scholar
- Kubina J: CPAN module Text-Ngramize.2010. [http://search.cpan.org/~kubina/Text-Ngramize-1.03/lib/Text/Ngramize.pm]Google Scholar
- Simões A: CPAN module Text-RewriteRules.2010. [http://search.cpan.org/~ambs/Text-RewriteRules-0.23/lib/Text/RewriteRules.pm]Google Scholar
- Humphrey M: CPAN module Lingua::StopWords.2010. [http://search.cpan.org/dist/Lingua-StopWords/]Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.