Choosing negative examples for the prediction of protein-protein interactions

The protein-protein interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. This need has prompted the development of a number of methods for predicting protein-protein interactions based on various sources of data and methodologies. The common method for choosing negative examples for training a predictor of protein-protein interactions is based on annotations of cellular localization, and the observation that pairs of proteins that have different localization patterns are unlikely to interact. While this method leads to high quality sets of non-interacting proteins, we find that this choice can lead to biased estimates of prediction accuracy, because the constraints placed on the distribution of the negative examples makes the task easier. The effects of this bias are demonstrated in the context of both sequence-based and non-sequence based features used for predicting protein-protein interactions.


Background
Despite advances in high-throughput experimental methods for detecting protein-protein interactions, the interaction networks for even well studied model organisms are far from complete. In addition, high throughput assays typically have a high rate of false positives [1]. Therefore, there is a continuing need for computational methods to complement existing experimental approaches.
Methods for predicting protein-protein interaction use a variety of data sources. Sequence-based methods are usually based on the domain, motif, or k-mer composition of the sequences. Sprinzak and Margalit [2] have noted that many pairs of structural domains tend to appear in interacting proteins, and have used this intuition to predict interactions according to the over-representation of pairs of domains. Domain and motif composition is also the basis of several Bayesian network models that aim to explain an observed interaction network in terms of interactions between pairs of motifs or domains [3][4][5]. In the context of kernel methods, similar kernels designed for predicting interactions from sequence were proposed in [6,7]. Other sequence-based methods use co-evolution of interacting proteins by comparing phylogenetic trees [8], correlated mutations [9], or gene fusion [10]. An alternative approach is to combine multiple sources of genomic information -gene expression, Gene Ontology annotations, transcriptional regulation, etc. -to predict comembership in a complex [11][12][13].
All the above-mentioned methods require an informed choice of positive examples (interacting pairs of proteins) from NIPS workshop on New Problems and Methods in Computational Biology Whistler, Canada. 18 December 2004 and negative examples (non-interacting pairs of proteins) for training and assessing the performance of a classifier. In view of the large fraction of false positive interactions generated by high throughput methods, positive examples need to be chosen with care. These are often chosen as interactions generated by reliable methods (small scale experiments), interactions confirmed by several methods, or interactions confirmed by interacting paralogs [1,11,14,15].
Negative examples also need to be chosen with care, and two such selection methods have been described in the literature. Because there are no "gold standard" non-interactions, some authors suggest that high quality noninteractions can be generated by considering pairs of proteins whose cellular localization is different, most likely preventing the proteins from participating in a biologically relevant interaction [11,16]. Other authors use a simpler scheme, selecting non-interacting pairs uniformly at random from the set of all proteins pairs that are not known to interact [4,7,12,17].
In this paper, we argue that that the first method is not appropriate for assessing classifier accuracy. In particular, we show that restricting negative examples to non colocalized protein pairs leads to a biased estimate of the accuracy of a predictor of protein-protein interactions. The basic assumption underlying the assessment of the accuracy of a classifier is that the distribution of testing examples reflects the intended use of the method. In the case of predicting protein-protein interactions, a simple uniform random choice of non-interacting protein pairs yields an unbiased estimate of the true distribution. In contrast, imposing the constraint of non co-localization may induce a different distribution on the features that are used for classification. The resulting biased distribution of negative examples leads to over-optimistic estimates of classifier accuracy. This bias is likely to affect results reported in several papers [5,6,11].
The simpler selection scheme -choosing negative examples uniformly at random -also has potential pitfalls: because the interaction network is not complete, the set of negative examples can be contaminated with interacting proteins. This contamination, however, is likely to be very small: it has been estimated that the number of interactions in yeast is well below 100,000 [1,18], a number which is 0.25 percent of the total number of protein pairs in yeast. This effect is likely to be much smaller than the contamination of even high-quality positive examples; moreover, our results show that a support vector machine classifier is resistant to even higher levels of label contamination.

Results
In this paper we postulate that testing a classifier of protein-protein interactions on negative examples composed of pairs of proteins that are not co-localized results in a biased assessment of classifier accuracy. In order to test this hypothesis we need to define "co-localization." We do this using the subcellular localization component of the Gene Ontology (GO). GO keywords are becoming the standard in annotating gene products [20]. These keywords are arranged in a hierarchical manner in a rooted, directed acyclic graph, where keywords lower in the hierarchy represent more specific terms. Therefore, one cannot say that two proteins are not co-localized simply because they don't share the exact same GO terms. As a similarity measure between two GO terms we use the negative log of the fraction of genes annotated with the lowest common ancestor of the two terms. This similarity was introduced as a similarity measure on a hierarchy in [21], used in the context of GO annotations in [22], and used in a kernel in [7]. Using this measure of similarity allows us to generate parameterized sets of negative examples characterized by a maximum degree of similarity allowed between their GO cellular compartment annotations.
Perhaps the simplest way to predict protein-protein interactions is to represent pairs of proteins by a set of genomic features that reflect how likely they are to interact. Examples of features that were used for this task are similarity of GO process and GO function annotations, correlation of gene expression, presence of similar transcription factor binding sites in the upstream region of the genes, participation in common regulatory modules and so on [11][12][13].  Table 1 illustrates that as we vary the upper bound on the allowed similarity between the cellular compartment annotations of pairs of proteins in the negative examples (called the co-localization threshold in what follows), GO function and process annotations, and microarray data become more predictive of protein-protein interactions, as measured using the ROC score (the area under the receiver operating characteristic curve). This observation is not surprising. Consider, for example, biological process annotations. Interacting proteins often participate in similar processes. Conversely, negative examples that are not co-localized will be less likely to participate in similar biological processes, making this variable more predictive of interaction. A similar argument holds for the GO function annotations and gene expression correlations. Note that GO function annotations are less predictive than the process annotations because interactions are often required for carrying out a particular process, whereas proteins that carry out the same function can do so in different contexts, not requring interaction.
Using non-co-localized negative examples can lead to a bias when using sequence-based features as well. In this case the features are pairs of sequence features, e.g., motifs or k-mers that belong to a pair of protein sequences. Such a kernel was used in [6,7] with a support vector machine (SVM) classifier. The dimensionality of the feature space of these kernels is very high, and in fact, the method doesn't use an explicit representation of the features. For the sequence-based features we show the existence of the bias incurred by using non-co-localized negative examples by showing that the accuracy of a classifier depends on the co-localization threshold of the negative examples on which the method was tested. Figure 1 illustrates the increase in classifier accuracy as the co-localization threshold is decreased. This effect is much larger than the variability that results from the randomness in the choice of negative examples and the cross-validation (CV) estimate: the standard deviation of the ROC score on 10 drawings of the negative examples was 0.003, and the variability between different runs of CV is even lower. We can explain the higher accuracy for low co-localization threshold by the fact that the constraint on localization restricts the negative examples to a sub-space of sequence space, making the learning problem easier than when there is no constraint. The spectrum kernel method uses pairs of k-mers as features; the motif method uses the composition of discrete sequence motifs, and the non-sequence method uses features such as co-expression as measured in microarray experiments, similarity in GO process and function annotations etc. We performed our experiment on two yeast physical interaction datasets: the BIND data is derived from the BIND database; the experiments using the non-sequence data were performed on a subset of reliable interactions that are found by multiple assays in BIND; DIP/MIPS is a dataset of reliable interactions derived from the DIP and MIPS databases.

(page number not for citation purposes)
In our experiments we used sets of negative examples characterized by the similarity of the localization annotations of two proteins. To see the relevance of our results to other published work, we need to establish a relationship between our co-localization threshold, and criteria used elsewhere. The data of [11] is used in several studies of protein-protein interactions. They considered five very broad cellular compartments (cytoplasm, mitochondrion, nucleus, plasma membrane, and secretory pathway organelles). Four of these have corresponding nodes in the cellular compartment part of GO. The average GO similarity between these compartments ranges from 0.002 to 0.36, and is 0.13 on average. At this level of the colocalization threshold our results show a strong effect.

Discussion
There are many pitfalls in designing machine learning experiments (see [23] for an example in the context of feature selection). Design of experiments in the field of bioinformatics, where various sources of data are often correlated, requires special care to make sure no information on the testing example labels leaks to the representation of the training examples. In this paper, we illustrated a phenomenon where, by constraining the distribution of negative examples, the classification problem becomes easier. Although choosing negative examples as pairs of proteins that are localized to different cellular compartments creates high-quality negative examples, it also makes them easier to distinguish from interacting proteins. In the case where the data is characterized by features such as similarity of GO process or function annotations, constraining the distribution of the component similarity has a direct effect on the distribution of the GO process annotation.
In the case of the sequence-based classifiers, the improvement in classifier performance is the result of constraining the negative examples to a smaller region of sequencespace. We see a difference between the behavior of the motif/pfam kernels and the spectrum kernel: the results with the spectrum kernel are more strongly affected by the distribution of negative examples. We believe that this difference is the result of the greater flexibility of the spectrum kernel, which allows it to fit arbitrary training sets. The motif/pfam kernels, by contrast, use features that are more biologically relevant, so cannot be biased as much as the spectrum kernel. The gold standard negative examples of [11] were not only constrained by lack of co-localization; they also demanded that both pairs of proteins have GO annotations in both the function and process components. This constraint would likely increase classifier accuracy even further.
The reader may suspect that the improvement in classifier accuracy when constraining the negative examples to be non-co-localized may be the result of higher quality negative examples. To address this concern we performed the following experiment to test the effect of changing the labels of a small fraction Without being aware of the bias in using gold standard non-interactions, one may think, looking at a couple of papers that describe methods for predicting protein protein interactions from sequence [5,6], that the problem is well addressed by these methods. However, this is not the case: the good performance is in fact a result of the biased selection of negative examples, and prediction of proteinprotein interactions from sequence is a difficult problem that can still be considered unsolved.

Positive Examples
We focus on prediction of physical interactions in yeast and use interaction data derived from several sources. These interactions are used as positive examples when training our classifiers.
• Data from BIND [24]. BIND includes published interaction data from high-throughput experiments as well as curated entries derived from published papers. Using physical interactions yields a dataset comprised of 10,517 interactions among 4233 yeast proteins (downloaded July 9th, 2004). Selecting interactions that were verified by multiple experimental assays yields a dataset of 750 trusted interactions. We used all the interactions for training, but assessed the performance only on trusted interactions.
• A curated set of high quality interactions from MIPS and DIP [25,26], also used in [5]. This set contains MIPS interactions that were annotated as physical interactions derived from small scale experiments, DIP interactions from small scale experiments, and DIP interactions verified by multiple experiments, for a total of 4838 interactions.
(page number not for citation purposes) In both cases we avoided using interactions that were validated by interacting paralogs in yeast to define trusted interactions, since those are likely to be easier to predict using the sequence-based methods. We eliminated selfinteractions from each dataset, since many of the features we use are based on measures of similarity between the two proteins, e.g., gene expression correlation, and similarity of GO annotations.

Negative Examples
We compared two methods for choosing negative examples in this paper: • Random pairs of proteins that are not known to physically interact.
• Parameterized sets of negative examples were chosen as random pairs of protein that are not known to physically interact, such that the similarity of their GO cellular compartment annotations is below some threshold. In

Support Vector Machines
The support vector machine (SVM) [27] is a classification method that provides state-of-the-art performance in many domains including bioinformatics [28,29]. SVMs access the data only through the kernel function which defines the similarity between data objects. This allows the use of SVMs even when an explicit vector-space representation of the data is not available, but a kernel function is provided. This is the case for one of the kernels used in this work, where a kernel between two pairs of sequences is defined (see below and [6,7]).

Figures of merit
In this paper we evaluate the accuracy of a trained classifier using two metrics. Both metrics -the area under the receiver operating characteristic curve (ROC score), and the normalized area under that curve up to the first 50 false positives, the ROC 50 score -aim to measure both sensitivity and specificity by integrating over a curve that plots the true positive rate as a function of the false positive rate. The motivation for using both metrics is provided for example in [7].

Pairwise kernels
The kernels proposed in the literature for handling genomic information, e.g., sequence kernels such as the motif and spectrum kernels presented below, provide a similarity between two sequences, or more generally, a similarity between a representation of two proteins. Therefore, such kernels are not directly applicable to the task of predicting protein-protein interactions, which requires a similarity between two pairs of proteins. Thus, we want a function K((X 1 , X 2 ), ( , )) that returns the similarity between the proteins X 1 and X 2 compared to the proteins and . We call a kernel that operates on individual genes or proteins a genomic kernel, and a kernel that compares pairs of genes or proteins a pairwise kernel. Two recent papers proposed an approach for converting a genomic kernel into a pairwise kernel [6,7]. They define the kernel K((X 1 , X 2 ), ( , )) = K'(X 1 , ) K'(X 2 , ) + K'(X 1 , where K'(·, ·) is any genomic kernel. The intuition behind the kernel is that for the pair (X 1 , X 2 ) to be considered similar to ( , ), X 1 needs to be similar to and X 2 needs to be similar to (the first term) or X 1 is similar to and X 2 is similar to (the second term). The feature space for this kernel is a vector space of (symmetrized) pairs of features from the underlying genomic kernel.

Sequence kernels
We use two sequence kernels in this work: the spectrum kernel [30] and the motif kernel [31]. The spectrum kernel models a sequence in the space of all k-mers, and its feature space is a vector of counts of the number of times each k-mer appears in the sequence. For the motif kernel we use discrete sequence motifs, representing a sequence in terms of a motif composition vector that counts how many times a discrete sequence motif matches the sequence. To compute the motif kernel we used discrete sequence motifs constructed from the eBlocks database [32]. Yeast ORFs contain occurrences of 17,768 motifs out of a set of 42,718 motifs. For both kernels we used a normalized linear kernel in the space of k-mer/motif counts: .

Availability
Data and code related to this work are available at: http:/ /noble.gs.washington.edu/proj/sppi. All the classification experiments were performed using the PyML machine learning library available at http://pyml.sourceforge.net.