Computational prediction of MoRFs based on protein sequences and minimax probability machine

Background Molecular recognition features (MoRFs) are one important type of disordered segments that can promote specific protein-protein interactions. They are located within longer intrinsically disordered regions (IDRs), and undergo disorder-to-order transitions upon binding to their interaction partners. The functional importance of MoRFs and the limitation of experimental identification make it necessary to predict MoRFs accurately with computational methods. Results In this study, a new sequence-based method, named as MoRFMPM, is proposed for predicting MoRFs. MoRFMPM uses minimax probability machine (MPM) to predict MoRFs based on 16 features and 3 different windows, which neither relying on other predictors nor calculating the properties of the surrounding regions of MoRFs separately. Comparing with ANCHOR, MoRFpred and MoRFCHiBi on the same test sets, MoRFMPM not only obtains higher AUC, but also obtains higher TPR at low FPR. Conclusions The features used in MoRFMPM can effectively predict MoRFs, especially after preprocessing. Besides, MoRFMPM uses a linear classification algorithm and does not rely on results of other predictors which makes it accessible and repeatable.


Background
Intrinsically disordered proteins (IDPs) are protein sequences that contain at least one region lacking a unique 3-D structure [1]. Although not being folded, IDPs perform a variety of important functions such as molecular recognition, transport catalysis, signaling regulation, entropic chain activities, and so on [2]. Furthermore, a single protein may contain several disordered regions that possess different functions [3]. The functions of disordered regions usually stem from their ability to bind to partner molecules [4]. Disordered regions can provide malleable interfaces which can recognize molecules through increase complementarity via induced fit or offer alternative interaction upon variable conditions and more complex cellular responses [5]. These recognition regions may form folded and complementary interfaces, while the neighboring regions, often denoted as fuzzy, can maintain their disordered state [6]. The notion of fuzziness implies that conformational heterogeneity can be maintained upon interactions of IDPs [7]. The disordered regions mainly contain two types of binding motifs: short linear motifs (SLiMs) and MoRFs. SLiMs are enriched in IDRs. They are generally conserved and 3-10 residues long, and thus may not fall into regular secondary structures [7]. MoRFs generally locate within longer IDRs and are up to 70 residues long [8]. They promote specific protein-protein interactions, and undergo disorder-to-order transitions upon binding their partners [4]. According to the structures they adopt in bound state, MoRFs can be classified into four subtypes: α-MoRFs, β-MoRFs, ι-MoRFs and complex-MoRFs [9]. The first three types form α-helix, β-strand, irregular secondary structure and the last one contains multiple secondary structures when bound [9].
Because of the functional importance of MoRFs and the limitation of experimental identification, several computational methods have been produced in recent years, such as α-MoRF-Pred I [10], α-MoRF-PredII [11], ANCHOR [12,13], MoRFpred [14], MSPSSMpred [15] and MoRF CHiBi [16]. α-MoRF-PredII is an improved method for α-MoRF-Pred I, which is limited to predict α-MoRFs. ANCHOR and MoRFpred are the most used comparison methods in recent years. ANCHOR is a web based method, which predicts protein binding regions that are disordered in isolation but can undergo disorder-to-order transition upon binding by using the energy estimation approach of IUPred [17]. MoRFpred is also a web based method, which is a comprehensive method. It calculates a MoRF propensity score using a linear kernel support vector machine (SVM) based on nine sets of features: physicochemical properties in Amino Acid Index [18], Position Specific Scoring Matrices (PSSM), predicted relative solvent accessibility [19], predicted B-factors [20] and the results of five different intrinsic disorder predictors. Then, using PSI-BLAST [21] to align the input sequence with the training sequence to gain an alignment e-value, which is used to adjust the calculated MoRF propensity score. MSPSSMpred using a radial basis function (RBF) kernel SVM model to predict MoRFs based on calculated conservation scores. This method does not use predicted results from other predictors as input, and the performance in AUC is approximate to MoRFpred. MoRF CHiBi uses two SVM models to predict MoRFs based on physicochemical properties of amino acids. The first model use a sigmoid kernel SVM to predict MoRF propensities, which target direct similarities between MoRF sequences. The second model focus on the general contrast of amino acid composition of MoRFs, Flanks and the general protein population using a RBF Gaussian kernel SVM. Finally, join the results of the two SVM models and compute the propensity score using Bayes rule. MoRF CHiBi is a very good MoRF predictor that does not rely on other predictors.
In this paper, we propose a novel sequence-based method, MoRF MPM , for predicting MoRFs. First, simulated annealing algorithm is utilized for selecting candidate feature sets from Amino Acid Index (AA Index) [18]. Then, five structural features from our previous study [22] about IDPs prediction are put into candidate sets for further selection, which contain Shannon entropy and topological entropy calculated directly from protein sequences, as well as three amino acid propensities from GlobPlot NAR paper [23]. Finally, we select 16 features and 3 different windows to preprocess the protein sequences and use MPM [24] which is a linear classification algorithm to predict MoRFs. The simulation results show that even though MoRF MPM just uses 16 features, 3 different windows and a linear classification, it obtains higher AUC and TPR than ANCHOR, MoRFpred and MoRF CHiBi .

Datasets
In order to compare our method with ANCHOR, MoRFpred and MoRF CHiBi , we use the datasets collected by Disfani et al. [14], which are also used to train and test MoRFpred and MoRF CHiBi . Disfani et al. collected a lot of protein complexes concerning interactions of protein-peptide from Protein Data Bank (PDB) [25] of March 2008 and filtered them on several principles to identify peptide regions of 5 to 25 residues which were presumed to be MoRFs. The obtained 840 protein sequences are divided into a training set (TRAINING) and a test set (TEST). There are 181 helical, 34 strand, 595 coil and 30 complex MoRF regions on the two sets. TRAINING contains 421 sequences which consists of 245,984 residues with 5396 MoRF residues. TEST contains 419 sequences which consists of 258,829 residues with 5153 MoRF residues. Besides, using the same protocol [26,27], they also collected TESTNEW set from PDB entries deposited between January 1 and March 11, 2012. TEST2012 contains 45 sequences which consists of 37,533 residues with 626 MoRF residues. In addition, we use the EXP53 collected by Malhis et al. [28] as the third test set. The test set contains 53 nonredundant sequences possessing MoRFs, which are collected from four publicly available experimentally validated sets. EXP53 includes 2432 MoRF residues which consist of 729 residues from short MoRF regions (up to 30 residues) and 1703 residues from long MoRF regions (longer than 30 residues). For more intuitive description of the four datasets, Table 1 lists their specific information.

Performance evaluation
We use AUC to evaluate the performance of different candidate feature sets and different windows. It is also utilized to compare our method with other methods. AUC is the area under the ROC curve, which can provide an overall assessment about the prediction. In order to compare the performance of each method in detail, we also calculate ACC and FPR at different TPR. ACC The detail information of four datasets describes the total number of residues that are correctly predicted, FPR is the false positive rate and TPR is the true positive rate. They are defined as: Where TP and TN are the numbers of accurately predicted MoRFs residues and non-MoRFs residues, N MoRF and N non are the total numbers of MoRFs residues and non-MoRFs residues, respectively.
Selecting the optimal feature set Firstly, we use simulated annealing algorithm to select several candidate sets of different feature number based on the TRAINING from 544 amino acid index. Then, we use MPM [24,29] to predict MoRFs based on these candidate feature sets, and select the feature set with the best performance. Figure 1 shows the predictive results on TRAINING and TEST with different candidate feature sets. The blue line represents the AUC values on TRAINING, the red line represents the AUC values on TEST. The distances between AUC values on the two sets reflect the overfitting situation of each candidate set, and the shorter the distance, the more robust the predictive performance. Because MPM is a linear classification algorithm, the over-fitting is not serious in all of these candidate sets. However, it is obvious that when the feature number in the candidate set is 12 or 13, the predictor gains more robust performance and better AUC value on TEST at the same time.
When the feature number of candidate set is 12 or 13, the predictive performance is approximate. Thus, to further compare their performance, the ROC curves are shown on Fig. 2. The left one shows the full ROC curves of them, which almost overlap. Since we are more concerned about the predictive performance at low FPR, the right figure shows the ROC curves at FPR < 0.1. Obviously, in this area, predictive performance on 13 is much better. Thus, we select the candidate set with 13 features as the final candidate feature set from AA Index, which is listed with the AA Index accession numbers in Table 2.
After that, we put the five structural properties which selected by our previous study [22] about IDPs prediction into the candidate feature set. Then, we change the number of structural properties in the candidate feature set and use MPM to predict MoRFs. Since there are only five structural features in total, we use the enumeration method to select structural properties for each candidate feature set with different number of structural properties. Figure 3 shows the best AUC values with different numbers of structural properties. Obviously, when the number is between 2 and 4, the performance is similar and obviously better than other cases. To further compare their performance, the ROC curves are shown on Fig. 4. Though the full ROC curves of them almost overlap as shown in the left figure, 3 and 4 obtain better performance at FPR < 0.1 as shown in the right figure. Considering that the AUC value of 3 is slightly higher than that 10 [23].

Selecting the appropriate windows sizes
We select three windows to preprocess protein sequences. Based on each window, we calculate the 16 selected features. Thus, each residue can obtain a 48 dimensional feature vector. Then, we change the sizes of three windows, and use MPM to predict MoRFs. The appropriate size of three windows are set by comparing their predictive performance on TRAIN-ING and TEST. Figure 5 shows the predictive performance with different windows sizes. The middle window is always set to the half size of the long window. In the left figure, we fix the size of the long and middle window to 90 and 45, and change the size of the short window from 5 to 11. Obviously, when the short window is set to 10, the AUC is better on TEST set. Then, we fix the short window to 10 and change the size of the long and middle windows as shown in the right figure of Fig. 4. The long window size is varied from 50 to 110, and the middle window size is changed following the long window. At the beginning, as the long window size increases, the AUC of both data sets increases, and the distance between them decreases. But when the size is larger than 80, the AUC of the two data sets grows slowly, and the distance between them increases. Moreover, when the size is larger than 90, the AUC of TEST tends to be stable. Figure 6 shows the ROC curves on TEST set with the long window size between 90 and 110. In the left figure, the ROC curves of the three sizes almost overlap. However, the ROC curve of 90 is better at low FPR as shown in the right figure. Considering that the proportion of MoRF residues is only about 2% in the TRAINING and TEST sets, we pay more attention to the predictive performance at low FPR. Thus, the long and middle windows are eventually set to 90 and 45.
Considering that researchers may require different precision depending on the applications, we do not set a standard threshold value. However, if one needs a binary categorical prediction, Table 3 provides three threshold values and their predictive results for reference, according to the FPRs on TRAINGING set. The threshold value can be selected in (− 0.5, 0.5), and the larger the value is, the larger the FPR.

Comparing with other prediction methods
In this part, we compare our method MoRF MPM with ANCHOR, MoRFpred and MoRF CHiBi for three test sets TEST, TESTNEW and EXP53. The results of other methods on these three sets are adopted from [16,28]. Table 4 shows the AUC values for the four methods on TEST and TESTNEW sets. Obviously, MoRF MPM  On TEST set, we also compare ACC and FPR at different TPR with other methods, as shown in Table 5. MoRF MPM achieves the lower FPRs and higher ACCs on the three TPRs compared with ANCHOR, MoRFpred and MoRF CHiBi . In other words, MoRF MPM can obtain higher TPR at low FPR.
In addition, Table 6 shows the AUC results of these four methods on EXP53 set. In EXP53_short set, only MoRF regions with up to 30 residues are considered, while longer MoRF regions are masked out. In EXP53_long set, only MoRF regions longer than 30 residues are considered, while shorter MoRF regions are masked out [28]. From Table 6, MoRF MPM also obtains higher AUC on EXP53_all, EXP53_short and EXP53_long sets.

Discussion
We propose a new method, MoRF MPM , to predict MoRFs within protein sequences. It uses MPM to train the predictor based on 16 features and 3  [22] about IDPs prediction including topological entropy and two amino acid propensities in GlobPlot NAR paper [23]. We compare MoRF MPM with ANCHOR, MoRFpred and MoRF CHiBi on three different test sets: TEST, TESTNEW and EXP53. The results show that MoRFMPM obtains better performance on these test sets.
To further illustrate the predictive performance of MoRF MPM , the protein p53 is predicted as an example, as shown in Fig. 7. The protein p53 is a master protein in tumor regulation, which is one of the most extensively studied IDPs [30,31]. The Nterminal and C-terminal regions of this protein are confirmed to contain MoRFs [32][33][34] which are enclosed by the red lines in Fig. 7. The blue line is the predictive results of MoRF MPM for each residue. From Fig. 7, MoRF MPM can effectively identify MoRFs of the protein p53.
The following points enable MoRF MPM to achieve such good performance. First, the appropriate preprocessing highlights the relationship between the residue and its surrounding residues. Second, the feature set used in MoRF MPM is highly effective for predicting MoRFs, especially after preprocessing. Third, instead of considering the properties of Flanks with fix length, MoRF MPM uses a long window of 90 to describe the influence of adjacent areas on MoRFs, and uses a short window of 10 to highlight the properties of MoRFs. Though the long window may contain much non-MoRFs information when calculating the feature vector of MoRF residues, MoRF MPM uses a middle window of 45 to reduce the noise brought by the long window. Finally, although MPM is a linear classification algorithm, it is efficient and robust, especially when there are not too many features used.

Preprocessing
To highlight the interrelation between residues, the protein sequences are preprocessed. For a general protein sequence w with length L, we select a window with the length of N(N < L) and fill N 0 = ⌊(N − 1)/2⌋ zeros at the beginning and end of the sequence. Then we slide the window to intercept regions of length N successively with step of length 1. At this point, the sequence length becomes L 0 = L + 2N 0 , and the intercepted region can be denoted as: where w 0 represents the sequence after zero-padding. For each w i , the values corresponding to the selected features are calculated as following: M k (w i ) denotes the value of k-th feature calculated on w i . For one amino acid property, M k (w i ) denotes the average value of w i mapped by the scale of the property. For Shannon entropy or topological entropy, M k (w i ) denotes the value calculated on w i by their respective formulas [22]. After that, we assign v i to each residue in w i . For each residue, add up all v i of them and divide by their respective cumulative number. The feature vector x j (1 ≤ j ≤ L) of each residue can be expressed as:

Feature selection
As mentioned, our feature set contains two parts: properties from AA Index [18] and structural properties. We first select properties from AA Index using simulated annealing algorithm, as shown in Fig. 8.
The detailed steps are as follows:     (1) According to the section of preprocessing, the sequences in TRAINING set are preprocessed based on the 544 amino acid scales from AA Index. Then, each residue can obtain a 544 dimensional feature vector. (2) Set the number of selected features N fea .
(3) Set the initial temperature T = T max , the lower limit temperature T min and the annealing rate r. (4) N fea features are selected randomly from 544 scales as the initial state S. Then, the distance between MoRF residues and non-MoRF residues is denoted as J d and calculated using the selected N fea feature vector. J d can be expressed by J d = tr(S w + S b ), where S b denotes the between-class scatter matrix Besides, m i represents the mean vector of the i-th class and m represents the total mean vector. Thus, the larger J d is, the more separable the two types of samples are. In this paper, we set T max = 1, T min = 0.0001, r = 0.9995. The parameter N fea is set from 10 to 20, and thus we obtain 11 candidate feature sets. Then, we use the 11 candidate feature sets to train MPM respectively, and select the feature set with the best prediction performance.
In addition, we select structure properties from five features used by our previous research [22] about IDPs prediction which contain Shannon entropy, topological entropy and three propensities from GlobPlot NAR paper [23] (http://globplot.embl.de/html/propensities.html) including the Deleage/Roux, Remark 465 and Bfactor (2STD) propensities. From [22], it has been shown that these five features can effectively predict IDPs. In addition, MoRFs generally locate within longer IDRs. Thus, we add these five features to the feature set obtained from AA index for further selection.
Since MoRFs generally locate within longer IDRs, the protein sequences with MoRFs usually contain three types of residues: MoRF residues, residues flanking (Flanks) the MoRFs and general non-MoRF residues. In other words, the Flanks represent other disordered residues on both sides of MoRFs, and general non-MoRF residues represent the ordered residues in the sequence. The properties of the three types of residues are different from each other. Thus MSPSSMpred and MoRF CHiBi calculate the properties of Flanks separately, and select 5 and 8 residues on both sides of MoRFs as Flanks respectively. However, the number of Flank residues in each MoRF region is different, and even the number on both sides of one MoRF region is also different. Therefore, instead of calculating the properties of Flanks separately, we consider the impact of Flanks by choosing three different windows. The first window is shorter to highlight the properties of MoRFs, and the second window is longer to highlight the influence of Flanks. The third window is between them to reduce the noise generated by the longer window. The short window is selected from 5 to 11. Meanwhile, since MoRFs generally locate within longer IDRs, we select the long window no less than 50. If the long window is very long, it may contain much non-MoRFs information when calculating the feature vectors of MoRF residues. These non-MoRFs information will reduce the predictive accuracy of MoRFs at low FPR that we are most concerned about, even if we have used a short window. Therefore, we select a middle window half the length of the long window to improve the performance at low FPR.
For selecting the optimum features from 544 amino acid indexes, we just use the short window and set the length to 10, firstly. Through preprocessing the TRAINING set, each residue gets a 544 × 1 feature vector. Then, using simulated annealing algorithm, we select several feature sets with different feature numbers as candidate feature sets. After that, we put the five structural properties into them, and predict MoRFs based on MPM algorithm with the short window of 10 and the long window of 50 to select the best feature set. Finally, we change the number of structural properties to further optimize the feature set.

MPM prediction model
MPM is a machine learning method of statistical learning proposed by Lanckriet et al. [24]. The main idea is to analyze the upper bound of classification error rate and make it as small as possible. Given a feature matrix to be classified X ¼ ½x 1 ; x 2 ; ⋯; x N s , where N s denotes the number of samples and x j (1 ≤ j ≤ N s ) denotes the feature vector of the j-th sample. Suppose that these samples are divided into two groups X 1 , X 2 ∈ X, and X 1~( μ 1 , R 1 ), X 2~( μ 2 , R 2 ). MPM is expected to build a classification surface W T X = b, which make the upper bound of the classification error rate as small as possible. Make an assumption that the correct classification satisfies W T X 1 > b for the first group and W T X 2 < b for the second group. The classification error rate is P{W T X 1 ≤ b} for the first group and P{W T X 2 ≥ b} for the second group. Then the classification surface constructed by MPM should satisfy the following requirements: Through a series of solutions, the optimization problem becomes: Since κ is only an intermediate variable, the optimization problem can be expressed as: The classification surface of MPM is finally reduced to solution formula Eq.7. It is a second order cone program problem, which can be solved by iterative least square method and interior point method. In this paper, we use the iterative least square method given in the reference [29]. Assuming that W * is the calculated optimal value, then the optimal κ and b can calculated by: Prediction process For a protein sequence to be predicted, the specific prediction process is shown in the Fig. 9. First, the sequence is preprocessed by the selected feature set with three different windows. Then, the calculated feature matrix is input into the trained MPM, and the predicted result is obtained. Fig. 9 Specific prediction process. Based on the selected feature set, the protein sequence is preprocessed by three different windows, and then is predicted by MPM