Datasets and feature descriptors
We use the dataset created and made available by Qi et. al for evaluation of active learning algorithms developed [25]. At the time of compilation of the data, 14600 pairs of proteins were known to interact; these pairs are referred to as positive pairs. A set of 400,000 pairs not overlapping with the positive pairs were generated randomly. These pairs, referred to as random pairs are considered to be non-interacting pairs, as the probability of a randomly generated pair to be interacting is less than 1 in 1000 [5, 26]. Of the newly discovered interactions, only 27 are found among the 400,000 randomly generated pairs.
Prediction of PPIs is setup as a binary classification task: each feature vector corresponds to a pair of proteins and it is classified as interacting or non-interacting. The feature vectors were computed by Qi et. al for both the interacting pairs and random pairs [25]. The vectors have 27 dimensions and contain features corresponding to Gene Ontology (GO) cell component (1), GO molecular function (1), GO biological process (1), co-occurrence in tissue (1), gene expression (16), sequence similarity (1), homology based (5) and domain interaction (1), where the numbers in brackets correspond to the number of elements contributed by the feature type to the feature vector. The GO features measure similarity of two genes based on the similarity between the terms they share in the Gene Ontology database. Three GO features were generated one each for the biological process, molecular function and cell component respectively. The 16 gene expression features were computed as the correlation coefficients of the protein pair using sixteen gene expression datasets in NCBI Gene Expression Omnibus database. The 'tissue feature' is a binary feature indicating whether the two proteins are expressed in the same tissue. Sequence similarity feature was obtained by measuring the BlastP sequence alignment E-value for the protein pair. In 'domain interaction feature' the interaction probability of a protein pair is measured based on the interaction probability of the domains present in the two proteins. The 'homology PPI feature' is estimated based on whether proteins homologous to the given pair of human proteins, interact in other species (such as Yeast, etc.) or not. The details of Qi et al's compilation of these features may be found in their supplementary website [25].
Not all types of features are available for each protein-pair. In other words, for several protein pairs, the feature vectors contain several missing values (as shown in Figure 1). Some pairs have feature vectors with 80% missing values (only 20% of feature types being available), while some pairs have values for all the feature types (100% feature coverage). In order to maintain balance of feature coverage between positive and random pairs (see Results section for details), a homogenous subset of data has been created such that every pair has more than 80% feature-coverage; (the challenge of coping with instances exhibiting very low feature coverage is, in general important but a separate investigation). This subset is used in this algorithm development and evaluation. This homogenous subset has 55,950 protein pairs in total. 10,000 protein pairs were selected randomly from this for training and another 10,000 for testing.
Further, the positive and negative pairs are combined in a ratio of 20%-80% (rationale is described in Results).
Evaluation metrics
Precision is measured as the fraction of correctly predicted protein interactions among all the pairs predicted by the classifier to be interacting. Recall is the fraction of the interacting protein pairs which the classifier is able to correctly identify as interacting pairs. F-score is the harmonic mean of precision and recall. F-score measures the accuracy of the method by combining both precision and recall values. Hence it can be used as the measure to compare the accuracy of the methods.
Random forest classifier
A random forest (RF) trains a set of decision trees on subsets of features. A majority vote of the decision trees is taken as the label of each test point. During the construction of a decision tree, for splitting each node, a subset of n out of the total N features is selected randomly, and the feature with maximum information gain out of the n is used to split the node. In this work, a random forest with 20 decision trees is constructed; to split the nodes, a subset of 7 features is selected from the total of 27. Of the 7 selected features, the feature offering maximum information gain is used to split that node. Random tree implementation of the Weka Package was used to create the decision trees in the Random Forest [27]. Minimum number of samples in each leaf node was set to be 10.
Active learning data selection strategies
To test the active learning component, all data is taken to be unlabeled data, and the active learning method asks for labels iteratively, based on the distribution of instances (protein pairs) and the learned decision function that is refined at each iteration. This process is repeated until the maximum number of labels is reached (usually called the "labelling budget"). In all the different types of data selection described below, labels are asked for 250 points in each iteration, and a total of 12 iterations are computed resulting with a total of 3000 acquired labels. In other active learning experiments the number of iterations equals the number of label requests; we reduce the number of iterations to reduce classifier retraining.
A. Baseline - random data selection
A Random Forest was constructed for 3 training data that differ from each other in the ratio of positive pairs they contain: 1%, 20% and 45% positive pairs respectively. Size of training data is incremented from 250 to 3000 pairs in steps of 250 pairs at a time. The 250 pairs in each iteration are selected randomly from the overall 10,000 data points assembled for training. A random forest is retrained in each iteration, and performance on the test data is evaluated
B. Density based
In this active learning technique, the data is clustered by a K-means algorithm. Labels are requested for a fixed number (S) of data points in each iteration. The selected points are distributed across the clusters in proportion to the size of the cluster. Let ni be the number of data points in cluster Ci and N be the total data size. Then, si, the number of points to be selected from cluster Ci is given by
and,
In each cluster Ci, si unlabelled data points closest to the centroid are selected and their labels are asked. The Weka Package [27] was used to implement the K-means clustering.
C. Uncertainty based (random seed)
In this active learning strategy, in the first iteration, the data whose labels are asked is selected randomly. A random forest is built with that data. In the following iterations, the data points selected for labelling are those which have maximum disagreement among the decision trees in the random forest. The entropy (confusion) in labelling the data point (protein pair) is measured as
where, p0 is the fraction of the decision trees in the Random forest that label the protein pair as non-interacting, and p1 is the fraction that label the protein pair as interacting.
In each iteration, 250 data points with the maximum confusion are selected and their labels are obtained. These are added to the existing set of labelled data and a new random forest is trained from this data. This new random forest is used in the next iteration for selecting the maximal-confusion points.
D. Uncertainty based (density-based seed)
This method is same as the previous method, except that in the first iteration, the data is selected by density (by performing K-means clustering as described in the 'density based' method) as opposed to selecting randomly.
E. Uncertainty based with history
This method is based on the technique proposed by Davy and Luz [22] in which entropy (confusion) is measured as the disagreement among the past 'm' predictions for a sample. We consider the past 3 predictions to measure confusion. The computation is carried out as follows:
Pi0(x) = probability that the protein-pair 'x' is non-interacting according to the ith classifier.
Pi1(x) = probability that the protein-pair 'x' is interacting according to the ith classifier.
where, i ∈ [1, m]
PA0(x) = average probability that the protein-pair 'x' is non-interacting according to past 'm' classifiers.
PA1(x) = average probability that the protein-pair 'x' is interacting according to past 'm' classifiers.
Confusion is measured as the sum of relative entropy between the average prediction values and the individual classifier predictions.
This method requires that the first 'm' classifiers be built by some other mechanism; subsequent iterations select data points using the confusion metric described above. Since 'uncertainty based - density-based seed' performed best among other methods (see Results section), the first 'm' classifiers were built using this technique.