Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features

Background Protein-protein interaction (PPI) extraction from published scientific articles is one key issue in biological research due to its importance in grasping biological processes. Despite considerable advances of recent research in automatic PPI extraction from articles, demand remains to enhance the performance of the existing methods. Results Our feature-based method incorporates the strength of many kinds of diverse features, such as lexical and word context features derived from sentences, syntactic features derived from parse trees, and features using existing patterns to extract PPIs automatically from articles. Among these abundant features, we assemble the related features into four groups and define the contribution level (CL) for each group, which consists of related features. Our method consists of two steps. First, we divide the training set into subsets based on the structure of the sentence and the existence of significant keywords (SKs) and apply the sentence patterns given in advance to each subset. Second, we automatically perform feature selection based on the CL values of the four groups that consist of related features and the k-nearest neighbor algorithm (k-NN) through three approaches: (1) focusing on the group with the best contribution level (BEST1G); (2) unoptimized combination of three groups with the best contribution levels (U3G); (3) optimized combination of two groups with the best contribution levels (O2G). Conclusions Our method outperforms other state-of-the-art PPI extraction systems in terms of F-score on the HPRD50 corpus and achieves promising results that are comparable with these PPI extraction systems on other corpora. Further, our method always obtains the best F-score on all the corpora than when using k-NN only without exploiting the CLs of the groups of related features.

yields high recall but low precision. Conversely, the second kind of method, pattern-or rule-based, which often utilizes handcrafted patterns or rules, achieves high precision but low recall. The third kind of method, machine learning-based, can be divided into two types: featurebased and kernel-based.
Feature-based methods represent an instance, which is a protein pair that consists of two protein names in a sentence, by many features, e.g., lexical, word context, and syntactic features derived from the sentence or its syntactic structure. Besides lexical and syntactic features, Liu et al. [1] exploited various features from dependency information, including predicate features involved in dependency graphs. Landeghem et al. [2] proposed rich feature vectors containing semantic information from dependency graphs and lexical information from sentences. They also first applied automatic feature selection techniques and showed that these techniques can enhance the generalization performance in PPI extraction and make the models faster and more cost-effective.
Recently, various kernel-based methods have been proposed that provide kernel functions that measure the similarity between any pair of instances represented by structural representations, such as constituent parse trees or dependency graphs. These kernel functions differ from each other in their type of input representation and how they compute similarity functions. Airola et al. [3] proposed a graph kernel-based method in which a sentence is represented by a combination of a dependency graph and another graph showing the linear order of words and considered all the possible paths connecting two entities in the dependency graph. Miwa et al. [4] proposed a method that combines kernels based on several syntactic parsers to retrieve the widest possible range of important information from a given sentence. Their method, which combines a bag-of-words kernel, a subset tree kernel, and a graph kernel, assigns the same weight to each individual kernel. Qian et al. [5] proposed a tree kernelbased method by exploiting both constituent parse trees and dependency graphs and further refined the tree representation from a constituent parse tree by utilizing the shortest dependency path between two proteins in a dependency graph.
Feature-based methods are considered more appropriate to practical applications than kernel-based methods due to the computation complexity of sophisticated kernels [1]. Moreover, feature-based methods can be improved by applying feature selection to enhance the generalization performance and attain faster and more cost-effective models [2]. The latest study of Tikk et al. [6] analyzed and compared the performances of diverse kinds of kernels. They found that different kernels using the same input representation perform similarly on a large number of protein pairs, which are identified as misclassified by most state-of-the-art kernel-based methods. Based on their convincing experimental results, to improve the PPI extraction performance, they argued that we should concentrate on finding and choosing informative features suitably rather than devising novel similarity functions encoded in kernels.
In this paper we propose a novel feature-based method to extract PPIs from articles. We exploit various features, including lexical features and word context features obtained directly from sentences, syntactic features obtained from parse trees, and features using existing patterns. We arrange the related features into four groups. For instance, we assemble two related features, which represent the positions of two protein names that constitute an instance in the sentence containing them, into one group. We also define the contribution level (CL) of each group, which consists of related features. Our method differs from existing methods in two ways. First, based on the structure of the sentence and the presence of significant keywords (SKs), we divide the training set into subsets and apply the sentence patterns provided beforehand to each one. Second, after computing the CL values of the four groups that consist of related features, we automatically implement feature selection based on their CLs and the k-nearest neighbor algorithm (k-NN) through three approaches: (1) focusing on the group with the best contribution level (BEST1G); (2) unoptimized combination of three groups with the best contribution levels (U3G); (3) optimized combination of two groups with the best contribution levels (O2G). To the best of our knowledge, this is the first method that automatically selects the most appropriate features for each group consisting of related features by combining their CLs with k-NN.

Outline of features of protein pairs in sentences used in extraction of protein-protein interaction information
The automatic extraction of PPI information from articles is regarded as a binary classification in which instances (protein pairs) that include PPI and instances that do not include PPI are classified as positive and negative instances. In machine learning approaches, based on labeled training data, extracted features, which are the characteristics of sentences containing protein pairs, are utilized to discriminate positive from negative instances. Then a model is learned from the training data, and each instance is classified as positive or negative.
Our features are categorized into four types: lexical features obtained directly from the sentence, word context features obtained directly from the sentence, syntactic features obtained from the parse tree, and features that use the existing patterns. In the following tables, P1, P2, and K denote the protein name that appears first, the protein name that appears later, and the keyword in a sentence, respectively.
• Lexical features obtained directly from sentences: These features are outlined in Table 1. • Word context features obtained directly from sentences: These features are outlined in Table 2. • Syntactic features obtained from parse trees: All sentences are transformed into representations called constituent parse trees, which can capture the syntactic structures of sentences. The features obtained from the constituent parse tree are outlined in Table 3. An example of a constituent parse tree output from the Stanford parser [7] is shown in Fig. 1. • Features using existing patterns: We prepared thirteen syntax patterns related to the presence or absence of PPI in Table 4 based on the syntax patterns that were proposed by Plake et al. [8]. iNoun and iVerb, which represent the sets of nouns and verbs related to interaction, are improved from the original ones used by Plake et al. [8]. The wildcard '*' represents any word or words in a pattern. The

Words indicating condition or presumption
Check if 'if' or 'whether' appears between keyword and one of the two protein names or between two protein names.

Preposition of keyword
Preposition following keyword providing that the distance between it and the keyword is within 3. If there are many prepositions, the preposition closer to the keyword is utilized.
One of the prepositions In sentence AIMed.d55.s487 of AIMed corpus, "The binding of LEC to CCR8 was much less significant," feature value is 'of'.
Second keyword Only one of seven words, 'bind', 'interact', 'stimulate', 'associate', 'regulate', 'induce', and 'known', is not chosen as a keyword, check if that word appears between two protein names. These seven words can be considered especially significant in PPI classification compared with other keywords. This feature prevents these words from being overlooked as keywords.
'true' or 'false' for each of these seven words (If one of these seven words appears in the sentence and is not chosen as a keyword, feature value for it is 'true').
In sentence IEPA.d0.s0 (Fig. 1  The distance defined by number of words appearing between keyword K and protein name P1 in the sentence.

Integer value
In sentence LLL.d33.s1 of LLL corpus, "GerE binds to a site on one of these promoters, cotX, that overlaps its -35 region," keyword is 'bind' and Distance_KP1 is 0.

Distance_KP2
The distance between keyword K and P2 in the sentence.

Distance_P1P2
The distance between two protein names in the sentence.

Position_P1
The value adding word distance between protein name P1 and beginning of the sentence to one.

Position_P2
The value adding word distance between protein name P2 and beginning of the sentence to one.

Comma between keyword and protein pair
Because topic of the sentence frequently changes before and after commas, we utilize the information if there is a comma between protein pair and keyword. 'x 1 x 2 ': x 1 is 't' if a comma exists between A and B, and x 2 is 't' if a comma exists between B and C, otherwise x 1 or x 2 is 'f', where A, B, and C represent a keyword and two protein names in order of their appearance in the sentence.

Multiple occurrences of keywords
Check whether there is more than one keyword in a sentence.

Parallel expression of a protein pair
Check whether the two protein names of the protein pair are contiguous in the word order of the sentence containing them (they are also considered contiguous even if '-', '/', 'and', 'or', '(' appears between them). If two protein names are described in parallel in a sentence, an interaction between them is unlikely.
Word context features extracted from sentences. P1, P2, and K denote the protein name appearing first, the protein name appearing later, and the keyword in a sentence, respectively. 't' and 'f' are abbreviations of 'true' and 'false' number of words substituted by a wildcard in a pattern is limited to five. If an instance (a protein pair (P1, P2)) matches (or does not match) one of these patterns, the feature value is set as 'true' (or 'false'). Figure 2 shows the framework of our PPI extraction system, which consists of two phases: division of the training set into subsets and PPI prediction based on evaluating the contribution levels (CLs) of the groups consisting of related features. In the first phase, the training data are partitioned into subsets based on the significant keywords described below and the feature position of keyword. In the second phase, the related features are arranged into groups, and cross-validation is performed on the training data to train the k-NN classifier to generate a predictive model, which is utilized to assess the CLs of the groups that consist of related features that indicate the efficiency in the selection of the optimal combination of the features for each group. After the CLs of these groups are estimated, the appropriate features are selected automatically from the following three approaches: focusing on the group with the best CL (BEST1G), unoptimized combination of the three groups with the best CLs (U3G), and optimized combination of the two groups with the best CLs (O2G). Finally, the k-NN classifier is used to classify the candidate PPI pairs of the test data.

Division of the training set into subsets Significant keyword (SK)
Since the feature keyword (Table 1) performs a noticeable role in identifying whether the sentence contains a PPI, it is utilized in the majority of research related to The height of first protein name P1 of instance at constituent parse tree. This height differs from features Distance_KP1, Distance_KP2, and Distance_P1P2, i.e., distances between protein pair and keyword described in Table 2.

Height_P2
The height of second protein name P2 of instance at constituent parse tree. This height differs from features Distance_KP1, Dis-tance_KP2, and Distance_P1P2.

Height_K
The height of keyword K at constituent parse tree. This height differs from features Distance_KP1, Distance_KP2, and Distance_P1P2.

POS_P1
We take into account the part-of-speech information of path from root at constituent parse tree of the two protein names constituting instance and keyword. It is possible to represent syntax structure and train classifiers to learn the pseudo grammar structure by this information. POS_P1 denotes part-of-speech information of path from root of leaf representing first protein P1 of instance.
The list of part-of-speech information of path from root at constituent parse tree.

POS_P2
POS_P2 denotes part-of-speech information of path from root of leaf representing second protein P2 of instance.
The list of part-of-speech information of path from root at constituent parse tree.
POS_K POS_K denotes part-of-speech information of path from root of leaf representing keywork K.
The list of part-of-speech information of path from root at constituent parse tree.
All sentences were transformed into representations called constituent parse trees, output from the Stanford parser [7]. Syntactic features were extracted from constituent parse trees. P1, P2, and K denote the protein name appearing first, the protein name appearing later, and the keyword in a sentence, respectively PPI extraction [9,10]. Nevertheless, emphasizing only this feature can cause an adverse effect in which other features may not receive appropriate attention or might even be ignored. Consequently, we distinguish between cases when the feature keyword contributes significantly to PPI classification and when it does not. We call the keyword in the former case SK.
Through the observation of the imbalance of the classes of instances when classifying them based on the presence or absence of a certain keyword K, we determine whether this feature keyword K is SK by defining imbalance degree ID(K) as follows: Fig. 1 Example of a constituent parse tree. Constituent parse tree for sentence, "Oxytocin stimulates IP3 production in dose-dependent fashion as well," from sentence IEPA.d0.s0 of IEPA corpus (first protein P1 is Oxytocin and second protein P2 is IP3) Pattern 7 iNoun between * P1 * and * P2 Pattern 8 complex between * P1 * and * P2 Pattern 9 complex of * P1 * and * P2

Pattern 13 between P1 and P2
We prepared syntax patterns related to PPI based on the syntax patterns proposed by Plake et al. [8]. P1 and P2 denote the protein names appearing first and later in a sentence, respectively. iNoun and iVerb denote sets of nouns and verbs related to interaction. The number of words substituted by a wildcard ' * ' in a pattern is limited to five. After the training set was divided into subsets based on the existence of significant keywords and the structure of the sentence, these syntax patterns were applied to each subset where N P and N N denote the number of positive and negative instances containing K in the training set, respectively. ID(K) = 1 means that K is completely balanced. On the contrary, ID(K) = 0 (i.e., 1 ID(K) = 0) means that K is completely imbalanced. Thus, when the value of ID(K) or 1 ID(K) is less than predefined threshold T, we regard K as a SK.

Position of keyword
In addition to the importance whether a keyword is SK, since the basic structure of a sentence can be grasped by determining the feature position of keyword that signifies the word order of the keyword and the pair of protein names in that sentence, this feature should also be stressed. We recognize that if the basic structures of the sentences differ, the features that should be emphasized in PPI classification will also differ. If the feature position of keyword's value is 'infix' (i.e., the keyword exists between a pair of protein names in the sentence), the sentence structure has the typical Subject-Verb-Object form. For example, in the sentence, "GerE binds to a site on one of these promoters, cotX, that overlaps its -35 region, " (the keyword 'bind' is the Verb, the protein 'GerE' is the Fig. 2 Framework of our PPI extraction system. Our system consists of two phases. First, training set is divided into subsets based on presence of significant keywords and the feature position of keyword. Second, after cross-validation is performed on the training data to assess the contribution levels of four groups, which consist of related features, feature selection is performed automatically through our three approaches (BEST1G, U3G, O2G). Finally, the k-NN classifier is used to classify candidate PPI pairs of test data Subject, the protein 'cotX' is the Object, and the feature position of keyword's value is 'infix'), since the relation of a protein name and the keyword is that of the Subject (or the Object) and the Verb, some features (e.g., Distance_KP1, Distance_KP2, and Distance_P1P2 that indicate the distances between protein names and the keyword) perform substantial roles. Conversely, if the feature position of keyword's value is 'prefix' or 'postfix' , the sentence structure is regarded as atypical, such as the parallel expression of protein names, inverted structure, and so forth. For example, in the sentence, "association between cdc25A and cdc2 was detected in the HeLa cells, " (the keyword is 'associate' obtained by stemming the word 'association' , the two protein names are 'cdc25A' and 'cdc2' , and the feature position of keyword's value is 'prefix'), since the relation of a protein name and the keyword, which is not that of the Subject (or the Object) and the Verb, is that of inverted structure, some features (e.g, Distance_KP1, Distance_KP2, and Distance_P1P2) are not always emphasized. Therefore, the importance of a feature is likely to be varied depending on the structure of the sentence.

Division of training set
Because the features that should be emphasized can vary depending on whether the feature keyword is a SK and the feature position of keyword between the two protein names as stated above, we divide the training set into three subsets, A, B, and C (Table 5), and generate three classifiers from each subset. Similarly, we divide the unlabeled instances into one of three subsets, A' , B' , and C' , and utilize the corresponding classifier to identify whether PPIs exist in these instances. The outline of this process is illustrated in Fig. 3.
Because the original training set is divided into three subsets, not all the patterns (Table 4) always match each subset. Consequently, we discard the irrelevant and useless patterns beforehand based on the structure of the sentence (Table 6). For subset A, since the feature position of keyword's value is 'infix' (i.e., the sentence structure has the typical Subject-Verb-Object form in which the keyword that corresponds to a verb exists between two protein names that correspond to a subject and an object), we observed that patterns 7, 8, 9, and 13 (Table 4) do not match this Subject-Verb-Object form. Patterns 7, 8, The training set was divided into subsets, A, B, and C, based on presence of the significant keyword and the feature position of keyword and 9 stand for phrases K-P1-P2 (the keyword is K and the protein pair is (P1, P2)). Similarly, pattern 13 is also unsuitable for subset A. For subset B, because the feature position of keyword's value is 'prefix' or 'postfix' (i.e., the sentence structure is atypical, such as the parallel expression of protein names, the inverted structure, and so forth), patterns 1, 2, 10, and 12 do not match these sentence structures. As a result, we remove them beforehand for subsets A and B. For subset C, because the position of keyword's value is 'infix' , 'prefix' , or 'postfix' , we do not eliminate any pattern shown in Table 4.

Prediction based on evaluating contribution levels of groups consisting of related features Groups consisting of related features
The syntactic features derived from the parsers, features showing the distances between two protein names and the keyword, and features showing the positions of two protein names are considered important in PPI classification [4,6]. However, among some related features (e.g., Height_P1, Height_P2, and Height_K mentioned in Table 3, which depict the heights of the two protein names of the instance and the keyword at the constituent parse tree), we cannot affirm intuitively which feature is definitely more important than the others without understanding the data characteristics. Therefore, we arranged the related features into the following four groups to automatically evaluate each one separately in the PPI extraction performance by the CL defined hereinafter: • Group G 1 consists of three related features: Distance_KP1, Distance_KP2, and Distance_P1P2 (Table 2). • Group G 2 consists of two related features: Position_P1 and Position_P2 (Table 2). • Group G 3 consists of three related features: Height_P1, Height_P2, and Height_K (Table 3). • Group G 4 consists of three related features: POS_P1, POS_P2, and POS_K (Table 3).
When some features in a group, which consists of related features, are useless and redundant, using all the features in that group can decrease the accuracy of the learning algorithm. In this case, removing some irrelevant features from that group (i.e., feature selection) can improve the PPI extraction performance. Conversely, when all the features in a group are useful, it is unnecessary to remove any feature from that group. By utilizing the original training data, we can determine automatically whether it is necessary to eliminate any feature in any group individually to enhance the PPI extraction performance.
After individually selecting the optimal feature set for each group, we assess the CLs of these groups in the Fig. 3 Outline of PPI prediction based on division of training set. Training set was divided into subsets, A, B, and C, based on existence of significant keyword and feature position of keyword. Three classifiers were generated from every subset. Similarly, unlabeled instances were divided into one of three subsets, A', B', and C', and corresponding classifier was used to identify whether PPIs exist in these instances process of training the classifier with the original training data. When some groups greatly contribute to the classifier training, we had better put more focus on these groups. In the opposite case, it is important to give ample consideration to other groups.

Cross-validation
Cross-validation (CV), which is a widespread strategy to estimate the model prediction performance, can also be utilized in feature selection to determine which subsets of features are useful in building good predictive models. In this paper we perform S-fold CV (SFCV) on original training data Train all to identify the redundant features in groups G 1 , G 2 , G 3 , and G 4 , which consist of related features, compute the CLs of these groups, and perform feature selection. In SFCV, original training data Train all is divided into S equal-sized partitions P i (i = 0, · · · , S − 1). In each Round i of SFCV (i = 0, . . . , S − 1), a combination of S − 1 partitions is used as training set Train i , defined as Train all − P i (i = 0, · · · , S − 1) (Fig. 4), to train the predictive model that is then validated on the remaining partition, called validation set Validation i (= P i ). We discuss it in detail below.

Contribution level (CL) of a group consisting of related features
To describe our method more simply from now, we also utilize the abbreviation SFCV, the notations of original training data Train all , partitions P i , training set Train i , and validation set Validation i (i = 0, · · · , S − 1), which were defined in the Cross-validation section. Because group G 1 contains three related features (Dis-tance_KP1, Distance_KP2, and Distance_P1P2), there are eight ways (2 3 = 8) to select features from these features to create eight combinations of features for G 1 . We performed SFCV to train the k-NN classifier on original training data Train all . For each training set Train i (i = 0, · · · , S − 1), we trained the k-NN classifier with every combination of the features in G 1 (we do not change or remove the features in other groups). Then for each Train i , we search for the optimal combination of features in G 1 that yields the maximum F-score on validation set Validation i (the definition of F-score is mentioned in the next section), denoted as Fcon 1i (1 signifies the index of G 1 ). Similarly, for each Train i , we also look for the optimal combinations of features in groups G 2 , G 3 , and G 4 separately that yield the maximum F-score, denoted as Fcon 2i , Fcon 3i , and Fcon 4i , on Validation i , respectively.
In this paper we define the CL for each group consisting of related features to improve the PPI extraction accuracy. CL is an indicator for determining the efficiency in the selection of the optimal combination of the features for each group. The pseudo-code for the calculation of the CLs of these four groups is shown in Listing 1. C j denotes the CL of each group G j (j = 1, 2, 3, 4). Function max returns the maximum value among its arguments. For all values of i from 0 to S − 1, if the maximum among the values of Fcon ji (j = 1, 2, 3, 4) is one of group G t , i.e., Fcon ti (t ∈ {1, 2, 3, 4}), we increase CL C t of G t .
From the CL values of the four groups output from function EvalCon4, if groups G x , G y , and G z still exist (x, y, z ∈ {1, 2, 3, 4}), which have the same CL values, we only recompute the CLs for G x , G y , and G z by function EvalCon3 (Listing 2) without recomputing the CL for the remaining group. After computing the CLs CL x , CL y , and CL z of G x , G y , and G z (lines 2-6), which resembles the calculation of the CLs of function EvalCon4, we resolve the exception (lines 8-24) in which the CLs of G x , G y , and G z become equal again. In this exception, we compute A x , A y , and A z (lines 10-12). A j (j ∈ {1, 2, 3, 4}) of group G j denotes the maximum among Fcon j0 , Fcon j1 , . . ., and Fcon j(S−1) . We regard the CLs of G x , G y , and G z as identical as A x , A y , and A z , respectively. If all three groups, G x , G y , and G z , still have the same CLs (lines 17-22), we regard the CLs of G x , G y , and G z as identical as A x , A y , and A z , respectively. In lines 18-20, function second_max returns the second maximum among its arguments.
From the CLs of the four groups output from function EvalCon4, if only two groups remain with identical CLs, we only recompute the CLs for these groups by function EvalCon2 without recomputing the CLs for the remaining groups. Because function EvalCon2 resembles function EvalCon3, we do not present its code here.  The ultimate goal of automatic PPI extraction from articles is extracting PPIs from any new unseen text with high predictive accuracy. In addition to PPI extraction accuracy, reducing the computational time for training and testing a PPI extraction system is also crucial. Tikk et al. [11] compared the performances of diverse kinds of kernels on a large-scale database called Medline, which contains nearly 120-M sentences. They reported that when the top three kernels (all path graph kernel, shallow linguistic kernel, and k-band shortest path spectrum kernel) that show the best performance were applied to Medline on a single processor, their runtimes were about 45, 141, and 4 days, respectively. Including the time to parse sentences, their runtimes changed to 226, 147, and 185 days, respectively. In other words, to extract PPI on Medline, we need about half a year. Consequently, lowering the computational time becomes critical for PPI extraction tasks. Landeghem et al. [2] also argued that we have to consider a trade-off between PPI extraction accuracy and computational time to decrease the amount of computational resources utilized by machine learning. Based on the CL values of four groups, G 1 , G 2 , G 3 , and G 4 , we employ the following three approaches: focusing on the group with the best contribution level (BEST1G), unoptimized combination of three groups with the best contribution levels (U3G), and optimized combination of two groups with the best contribution levels (O2G) to automatically perform feature selection to enhance PPI extraction accuracy. For the above reason, we also take into account reducing the computational time by feature selection. We briefly describe the advantages of our three approaches below.
BEST1G: We only performed feature selection automatically on the group with the best CL among the four groups. Although we also aim at improving PPI extraction accuracy, our concern with this approach is mainly about limiting the computational time by feature selection as much as possible. Therefore, compared with U3G and O2G, the advantage of BEST1G is the least computational time due to feature selection. However, since BEST1G is not guaranteed to always attain better extraction accuracy than U3G and O2G, it is suitable for a large-sized dataset in which effectively decreasing the computational time is always the most required factor. U3G: We merged all the features in the three groups with the best CLs among the four groups. Then we automatically performed feature selection on these merged features from these three groups by a greedy algorithm. Although this greedy strategy generally does not produce an optimal subset of these features, it may yield a locally optimal subset of them. We want to exploit the information from the features in these three groups to improve the extraction accuracy without wasting time performing feature selection by exhaustively searching for all their combinations. Hence, compared with BEST1G and O2G, the advantage of U3G is maintaining a trade-off between extraction accuracy and computational time due to feature selection. However, U3G is also not guaranteed to always achieve better extraction accuracy than BEST1G and O2G. As a result, U3G is suitable for a mediumsized dataset in which balancing extraction accuracy and computational time is necessary.
O2G: We merged all the features in the two groups with the best CLs among the four groups. Then we automatically performed feature selection on these merged features from these two groups by exhaustively searching for all the combinations of these features. Consequently, we assume that O2G performs better with respect to extraction accuracy than BEST1G and U3G. The disadvantage of O2G is higher computational times due to feature selection than BEST1G and U3G when applying O2G, BEST1G, and U3G to a medium-sized dataset or a large-sized dataset.
As a result, O2G is suitable for a small-sized dataset in which effectively increasing the extraction accuracy is always the most required factor.
We describe the implementation details of BEST1G, U3G, and O2G below.
BEST1G First, we selected group G j having the best CL among the four groups. For all the values of i from 0 to S − 1, we then searched for the maximum of Fcon ji and its corresponding combination of features in G j when applying SFCV and k-NN to training set Train i .
U3G First, we selected three groups having the best CLs among the four groups and merged all the features in them. Assume that the order of CLs C a , C b , and C c of the three corresponding selected groups G a , G b , and G c is C a > C b > C c . For each training set Train i (i = 0, · · · , S − 1), we performed k-NN by gradually removing the features in these three groups in the order of their CLs (i.e., we eliminated the features in G a first, then those in G b , and finally those in G c ), and if the F-score, denoted as F i , on validation set Validation i improved even slightly, we conclude immediately that this improved value of F i , denoted as Fu i , is the unoptimized value for Train i .
Next, for all the values of i from 0 to S − 1, we searched for the maximum of Fu i and its corresponding combination of features in these three groups.
O2G First, we selected two groups having the best CLs among the four groups and merged all the features in them. For each training set Train i (i = 0, · · · , S − 1), we performed k-NN with every combination of the features in these two merged groups. Then, for each training set Train i , we searched for the optimal combination of the features in these two merged groups that yields the maximum F-score, denoted as Fo i , on validation set Validation i .
Next, for all the values of i from 0 to S − 1, we searched for the maximum of Fo i and its corresponding combination of features in these two merged groups.

Datasets
We utilized all the datasets from four typical PPIannotated corpora: LLL [12], HPRD50 [13], IEPA [14], and AIMed [15] (except BioInfer [16]). Table 7 shows the number of positive and negative instances in them. In addition to the 200 PubMed abstracts in AIMed that were manually annotated for interactions between human genes and proteins, 30 other abstracts without PPIs were added to AIMed as negative instances. HPRD50 is comprised of 50 abstracts, in which the human gene and protein names were automatically identified by ProMiner software. IEPA was created from 303 PubMed abstracts, each Table 7 Number of positive and negative instances in four  corpora: LLL, HPRD50, IEPA, and AIMed   Corpus  LLL  HPRD50  IEPA  AIMed   Positive instances  164  163  335  1000   Negative instances  166  270  482  4834 Four corpora, LLL, HPRD50, IEPA, and AIMed, were converted into a unified XML format with a very simple structure by Pyysalo et al. [17] to make the corpora easily accessible to users. Number of positive instances (interacting protein pairs) and negative instances (non-interacting protein pairs) in each corpus is shown of which contains a specific pair of co-occurring chemicals. The LLL corpus contains 77 sentences and was the shared dataset for the Learning Language in Logic 2005 challenge (LLL05). The LLL domain is the gene interactions of Bacillus subtilis. These corpora carry information about named entities from biological domains and annotated PPIs. Nevertheless, they differ from one another. For example, with respect to the scope of the annotated entities, most of them typically contain proteins and genes, some contain RNAs, but IEPA contains only chemicals. The policies of entity annotation and interaction annotation among these copora are also slightly different [17]. Pyysalo et al. [17] reported that although the entity annotation of the types relevant to the corpus is exhaustive only in AIMed and BioInfer, entity annotation is merely based on lists of entity names or the named entity recognizer output in other corpora. They also indicated that the differences in interaction annotation are even greater than those in entity annotation, e.g., only BioInfer and IEPA contain information identifying the words that state an interaction, and all but HPRD50 specify the direction of the interactions. They converted these corpora into a unified XML format, which we utilized in our study, with a very simple structure to make the corpora easily accessible to users. In the unified XML format, each corpus is comprised of documents that are abstracts of articles, each document is comprised of sentences, each sentence might contain some proteins, and the names and relations of proteins are annotated by some attributes of this XML format. We regard the PPI extraction task as a binary classification in which interacting protein pairs are considered positive instances and vice versa. If a sentence contains n proteins, n 2 instances (i.e., protein pairs) are generated. In this paper, the two protein names of a candidate PPI instance and the other proteins in the same sentence are renamed as P1, P2, and P0 to blind the learner to allow direct comparison to earlier studies. By utilizing the information contained in the attributes of the unified XML format of these corpora, we parsed all the sentences in them to extract the features.
BioInfer has an especially extensive annotation policy that combines three types of annotations: termed entity, entity relationship, and dependency. Some sentences of this corpus were annotated with protein names that do not contain contiguous tokens. For instance, in the phrase, "When liver-or islet-type glucokinase was transiently expressed in COS-7 cells" from sentence BioInfer.d43.s3, two protein names are annotated, "liver-type glucokinase" and "islet-type glucokinase, " and "liver-type glucokinase" is annotated as a protein reference despite its discontinuous name. In this paper, the two protein names of a candidate PPI instance and the other proteins in the same sentence are assumed to contain only contiguous tokens. Although we currently cannot tackle these non-contiguous protein names of the BioInfer corpus, which is why we do not analyze the dataset of this corpus, we intend to deal with this problem in the future.

Evaluation methods
To allow for direct comparison to earlier studies, we evaluated the performance of our method by 10-fold document-level cross-validation (10FDLCV). In each round of 10FDLCV, one of the ten partitions containing 10 % of the documents is used as a test set, and the combination of the remaining nine partitions containing 90 % of the documents is used as a training set. SFCV, which is described in the Method section, is performed as 9FDLCV (S is set to 9). We also adopted the One Answer per Occurrence strategy in which a correct interaction has to be extracted for each occurrence of the instance. The threshold T's value (the Method section) is set to 0.18.
For classifying PPI, we use the k-NN algorithm mentioned above by normalizing the value of each feature before computing the Euclidean distance between two feature vectors. The values of the features are normalized so that they all lie between 0 and 1 and the features on different scales have the same impact on the distance function. If two categorical features are identical (or different), we consider the difference between them as 0 (or 1).
For small-sized corpora, LLL, HPRD50, and IEPA, based on the rule of thumb in machine learning that chooses parameter k of k-NN to be near the square root of the size of the training set, we chose k to be odd and equal to 19, 21, and 29 (in binary classification, we choose an odd k of k-NN to avoid ties). For a larger AIMed corpus containing the most highly imbalanced PPI data, if the rule of thumb is applied, k is too big, about 71-73, and the computation cost (i.e., the computation of the distances among instances) also increases greatly with this value of k. Therefore, for the AIMed corpus, we performed crossvalidation on the training data with a range of value of k, and chose k that equals 11 based on the lowest root mean square error.
Precision (P), recall (R), and harmonic value F-score (F) are used as evaluation measures, which are defined as follows: TP, FP, and FN denote the number of true positives, false positives, and false negatives, respectively. Precision is the percentage of correct predictions from all the instances predicted as positive. Recall is the percentage of correctly predicted positive instances from all positive instances. Table 8 shows the results of our three approaches (BEST1G, U3G, O2G). As a baseline, we also added the results when only k-NN was applied, and feature selection using the CLs of the groups consisting of related features was not performed. Our three approaches considerably improved all the F-score results on all the corpora compared with the case that only uses k-NN without performing feature selection that utilizes the CLs of these groups.

Experiment results
Compared with BEST1G, although U3G attains a better result on IEPA or an equivalent result on HPRD50, BEST1G outperforms U3G on LLL and AIMed in terms of F-score. BEST1G learned all the possible combinations Precision (P), recall (R), F-score (F) results of our three approaches (BEST1G, U3G, O2G) evaluated by 10-fold document-level cross-validation on four corpora, LLL, HPRD50, IEPA, and AIMed, shown in the second, third, and fourth row. As a baseline, in the first row, we add results when only k-NN is applied, and feature selection using contribution levels of groups consisting of related features was not performed. Precision (P), recall (R), and F-score (F) values are shown by percentage (%). Bold typeface shows best results per corpus in terms of precision, recall, and F-score of the features in the group with the best CL, whereas there are cases where U3G can immediately halt the feature selection completely when only a feature in the group with the best CL is removed and the F-score on the validation set is only slightly improved. Therefore, the BEST1G results are generally better than those of U3G. Conversely, since O2G exhaustively searches for all the combinations of the features in the two merged groups having the best CLs, O2G exceeds BEST1G on HPRD50 and IEPA and greatly surpasses U3G on LLL and HPRD50. Generally, despite the highest computational time due to feature selection when applied to the medium-sized IEPA corpus or the large-sized AIMed corpus, O2G performs the best among our three approaches, as we assumed in the Method section.
However, depending on the characteristics of the corpus, the best performance belongs to BEST1G, U3G, and O2G on AIMed, IEPA, and HPRD50, respectively, in terms of F-score. Both BEST1G and O2G performed best on LLL. O2G failed to attain the best F-scores on IEPA and AIMed, as we assumed in the Method section, because the heterogeneity among corpora can lead to heterogeneous evaluation results. For example, unlike other corpora, 30 abstracts without PPIs were intentionally added to AIMed as negative instances. As a result, the percentage of sentences that have no entities is 18 % in AIMed, but it is 0 % in the other corpora. The percentage of sentences that have no interactions is 69 % in AIMed, but it is less than or equal to 38 % in the other corpora.
The ratio between the number of positive instances and all the instances in the highly imbalanced AIMed is too low, about only 17.1 % compared with the other corpora. Additionally, unlike other corpora, the scope of the annotated entities in IEPA is chemicals. Therefore, due to the enormous differences among the corpora, in a few cases, O2G may not be the best approach, even though overall it is superior to BEST1G and U3G with respect to extraction accuracy.
Moreover, the F-score results of O2G on IEPA and AIMed still surpass those when only k-NN is applied. As described in the Method section, with a large-sized corpus like AIMed in which effectively decreasing the computational time is always the most required factor, we should apply BEST1G because it has the lowest computational time. With a medium-sized corpus like IEPA in which we need a trade-off between extraction accuracy and computational time, we should utilize U3G. With small-sized copora like LLL and HPRD50 in which effectively increasing extraction accuracy is always the most necessary factor, O2G remains the best choice.

Comparison with other related research
The performance comparison of our three approaches (BEST1G, U3G, O2G) with other related research is shown in Table 9. The results of the co-occurrence and rule-based methods are also listed in Table 9 as a baseline. Fundel et al., who proposed the RelEx system [13], applied a small number of simple rules to dependency  [18], utilized support vector machines with tree kernels to extract rules for PPI extraction by a combination of deep syntactic parser Enju and a shallow dependency parser. Our approaches (BEST1G, U3G, O2G) are mostly far better than these co-occurrence and rule-based methods on all the corpora in terms of F-score. Compared with the feature-based methods, BEST1G, U3G, and O2G outperformed their performances on LLL, HPRD50, and IEPA (except compared with F-score of the method by Liu et al. [1] on LLL).
Compared with the kernel-based methods, BEST1G, U3G, and O2G achieved F-score results on par with two of these methods (Airola et al. [3], Tikk et al. [11]) on LLL and with three of these methods (Miwa et al. Each of the methods has its own disadvantages and advantages. Therefore, no method is almighty and powerful for all the corpora. For each kernel-method, there exists a corpus for which that particular method is the best compared with other methods (e.g., the methods proposed by Airola et al., Miwa et al., and Qian et al. outperformed other methods in terms of F-score on IEPA, AIMed, and LLL, respectively), and there exist corpora for which that particular method is not suitable (i.e., the F-score results of that particular method on these corpora are not good). In the method by Tikk et al., there is no corpus showing remarkably high result of F-score compared with our approaches and other methods.
Especially in case of the HPRD50 corpus, our approaches greatly surpassed the feature-based methods as well as the kernel-based ones. In terms of F-scores on HPRD50, O2G exceeds the best results of the featurebased methods and the kernel-based ones (i.e., 64.9 % and 71.0 %) by 7.7 % and 1.6 %, respectively. In that respect, our approaches have a distinct unique value compared with other related research.
Conversely, in case of the AIMed corpus, although our approaches performed fairly better than the Tikk et al. and Yakushiji et al. [19] methods or on par with the method by Landeghem et al. [2], our approaches are not better than other feature-based and kernel-based methods in terms of F-scores. However, a direct comparison among different systems on AIMed is not straightforward due to the difference in data preprocessing, remarkably different interpretations related to the number of interacting or non-interacting pairs in the AIMed corpus [3], learning methods, parameter tuning, and different evaluation methods [11]. For instance, Liu et al. [1] identified two more interacting and 40 fewer non-interacting protein pairs than our study in AIMed. Similarly, 164 fewer noninteracting protein pairs were reported in the work of Landeghem et al. Moreover, since the self-interactions (59 instances) of the AIMed corpus are not regarded as PPI candidates and were removed from the corpus prior to evaluation in the methods proposed by Liu et al., Airola et al., Miwa et al., and Qian et al., the F-score results of these systems would be higher than our method in which we did not discard self-interactions in the preprocessing step. These differences could boost the performance of these studies. Further, we did not utilize dependency information, like the methods proposed by Liu

Conclusion
In this paper we propose a novel method for automatic PPI extraction from research articles. Our method automatically implements feature selection based on evaluating the CLs of the groups that consist of related features to enhance the extraction performance. Our three approaches (BEST1G, U3G, O2G) attain comparable performance in all the corpora and achieve better results in the HPRD50 corpus than the other previous research. In addition, our three approaches always gain better F-scores in all the corpora than when only using k-NN without utilizing the CLs of the groups that consist of related features.
In realistic PPI datasets, there are far fewer interacting protein pairs than non-interacting ones. The imbalance of the PPI data can compromise the process of learning. Dealing with the imbalance of PPI data is highly challenging [20]. For future work, we plan to design a more efficient method to resolve this problem. Moreover, we intend to design an ensemble that is comprised of dissimilar kernels, devise more effective representations of instances, and explore novel useful features to improve the PPI extraction performance. We also plan to create a predictive model for PPI extraction based on deep learning techniques by combining with kernels to extract PPIs from such large text collections as Medline.