In simple words, the machine learning problem being addressed here can be stated as: given a set of membrane-binding proteins, can we identify other membrane-binding proteins from a larger unlabeled set while classifying the proteins in the unlabeled set as positive or negative? The main components of the classification process are feature development, classifiers, validation techniques and performance criteria which are discussed below after explaining PU-learning in detail.
Theory behind PU-learning
PU-learning attempts to build a classifier using a two step-strategy [2, 3] (Figure 2):
Step 1: Identifying a set of reliable negative examples from the unlabeled set.
Step 2: Building a set of classifiers by iteratively applying a classification algorithm and then selecting a good classifier from the set.
These two steps together can be seen as an iterative method of increasing the number of unlabeled examples that are classified as negative while maintaining the number of correctly classified positive examples. There are a couple of techniques proposed for each step. For the first step, Rocchio technique, the Spy technique, and the 1-DNF technique can be used. For the second step, any classifier, such as SVM, Bayes classifier, random forest or decision trees can be used. For the focus of this article, only the spy-technique is explained in great details. The readers are directed to relevant literature [2, 3] for details about the other techniques regarding the first step.
In the spy technique, "spy" examples from the positive set (called the P set) are sent to the mixed or unlabeled set (called the U set) (Figure 3). This approach randomly selects s% of the examples from the P set (in our experiment, we use 15%). These examples form the 'spies' set, denoted by S, which is added to the U set. The spies behave identically to the unknown positive examples in U and hence allow us to reliably infer the behavior of the unknown positive examples. The algorithm is as follows:
-
1.
N = •
-
2.
S = sample (P, s%)
-
3.
U = U ∪ S
-
4.
P = P - S
-
5.
Assign every example in P the class c1
-
6.
Assign every example in U the class c2
-
7.
Run any classifier
-
8.
Classify each example in U
-
9.
Determine the probability threshold t using S
-
10.
for each example dj in U
-
11.
if its probability Pr [c1|dj] < t
-
12.
RN = RN ∪ {dj}
-
13.
U = U - {dj}
-
14.
Repeat steps 7 to 13 with RN and U until RN does not change
The first classifier is built using the P set (after removal of spies) as positive set and the U + S (spies set) as the negative set. This classifier is then tested on the U+S set. A threshold is then determined such that all the spies are classified as positive. The unlabeled examples that are below that threshold form the first reliable negative (RN1) set and the remaining examples in U form the Q1 set. The process is then repeated with P (combined with S set) as positive and the reliable negative (RN) set as negative and the resultant classifier is tested on the Q set to further extract reliable negative examples from the Q set. The process above is repeated until no more examples in the Q set can be classified as negative. The final classifier is then built using the N and the original P set. The pseudo code is as follows:
Every example in P is assigned the class label 1;
Every example in RN is assigned the class label -1;
i = 1;
Loop
Use P and RN to train a classifier Si;
Classify Q using Si;
Let the set of examples in Q classified as negative be W
if W = {} then exit-loop
else Q = Q - W;
RN = RN ∪ W;
i = i+1;
In this article, we also propose and implement a variation of the spy-technique. The spies are added in each iteration as opposed to just the first iteration [3]. Figure 4 shows this technique in a cartoon representation.
So, PU-learning adds an additional layer to the standard supervised learning. In this layer the first set of reliable negative examples is created. Using this reliable negative set and the original positive set, more reliable negative examples are extracted in the subsequent steps using a classifier iteratively. For this study, spy-technique was used for the first step and random forests were used for the second step. As we will show, the PU-learning protocol with spy-technique can be effectively used to build an identification protocol for predicting the membrane binding properties of a large number of modular domains with unknown properties.
Dataset
For the creation of positive dataset, entire human, mouse and yeast proteomes were downloaded from the Swiss-Prot database [11]. All the sequences of the peripheral domains were then extracted by using their names as keywords resulting in 932 cases. The known domains include C1, C2, PH, PX, FYVE, ANTH, BAR, FERM and Tubby domains. Sequence identity was then reduced to 40% using CD-HIT [12] among all the pairs reducing the number of sequences to 232. For unlabeled set, all the other domains except the positive ones were selected from the three proteomes giving a total of approximately 32,000 examples. After reducing the sequence identity to 20%, the number of unlabeled examples was 3,759. A higher sequence identity was used for the positive case as the number of positive examples is few and using a lower sequence identity would result in much fewer examples that might be insufficient for building a reliable classification model.
Features
During feature development, a protein sequence is reduced to a fixed set of features encoding the characteristics of a protein. It is always advisable to choose the features that are supposed to be pertinent to the function of a protein and display large variation between positive and negative set.
All intracellular membranes contain varying degrees of anionic lipids with the inner plasma membrane being the most anionic [13, 14]. Thus, electrostatic complementarity between cationic proteins and anionic membranes should be an important factor in membrane binding of peripheral proteins. Thus, on the basis of previous studies on membrane binding proteins, various sequence-based features were selected: the overall charge of the protein; the sum of hydrophobicity, helix propensity and sheet propensity; and the overall sequence composition of the domain (% of each kind of amino acid). In addition, a new family of features called local environment amino acid composition is also used. This feature representation defines a residue based on both its identity and its environment of found kinds: low helix and high sheet propensity, high helix and low sheet propensity, and so on. This forms 80 (4 × 20) distinct counts in the new feature vector.
Performance criteria and evaluation technique
The performance of the classifiers is measured using different metrics. Specifically, the commonly-used threshold metrics include accuracy and sensitivity. Accuracy is the ratio of correct predictions to the total number of predictions.
Sensitivity, also known as recall or true positive rate, TPR, is defined as the probability that a prediction is predicted positive given the example is positive. It is approximated by the fraction of true positives predicted as positive.
For evaluating the performance of the protocol, the holdout technique was used during testing. During this evaluation, 40 positive examples were left out for testing the final classification protocol. These examples were not used for training purposes at any stage of the model building. Due to an uncertainty about their classes, no examples from the unlabeled set were left out for testing and so only sensitivity was used for performance evaluation. During training, 5-fold cross-validation was used to optimize the parameters.
Classifier
Decision trees, specifically, C4.5 was used as a classifier. A decision tree [15] constructs from the training data a tree model where every internal node represents a decision and a leaf represents its classification. The learning process starts by finding a split on a single attribute that best classifies the training data; then the dataset is recursively split into two parts repeating these steps on each subset. There are a number of loss (or impurity) functions that are used to find the best split or the split with the minimum loss (or error). Specifically, the C4.5 [16] decision tree algorithm developed by Quinlan uses a loss function known as the information gain, which is motivated by information theory. The decision tree has several advantages. First, it is fast to train and evaluate. Second, the model (or function) learned during the training process is usually compact and easy to interpret. Finally, a decision tree does not require much data preprocessing, natively handling most attributes types. Note, most machine learning algorithms have tunable parameters. In this work, the results reported using the C4.5 decision tree algorithm use the default values empirically found to work well on a number of datasets.