MFSPSSMpred: identifying short disorder-to-order binding regions in disordered proteins based on contextual local evolutionary conservation
© Fang et al.; licensee BioMed Central Ltd. 2013
Received: 16 March 2013
Accepted: 1 October 2013
Published: 4 October 2013
Molecular recognition features (MoRFs) are short binding regions located in longer intrinsically disordered protein regions. Although these short regions lack a stable structure in the natural state, they readily undergo disorder-to-order transitions upon binding to their partner molecules. MoRFs play critical roles in the molecular interaction network of a cell, and are associated with many human genetic diseases. Therefore, identification of MoRFs is an important step in understanding functional aspects of these proteins and in finding applications in drug design.
Here, we propose a novel method for identifying MoRFs, named as MFSPSSMpred (Masked, Filtered and Smoothed Position-Specific Scoring Matrix-based Predictor). Firstly, a masking method is used to calculate the average local conservation scores of residues within a masking-window length in the position-specific scoring matrix (PSSM). Then, the scores below the average are filtered out. Finally, a smoothing method is used to incorporate the features of flanking regions for each residue to prepare the feature sets for prediction. Our method employs no predicted results from other classifiers as input, i.e., all features used in this method are extracted from the PSSM of sequence only. Experimental results show that, comparing with other methods tested on the same datasets, our method achieves the best performance: achieving 0.004~0.079 higher AUC than other methods when tested on TEST419, and achieving 0.045~0.212 higher AUC than other methods when tested on TEST2012. In addition, when tested on an independent membrane proteins-related dataset, MFSPSSMpred significantly outperformed the existing predictor MoRFpred.
This study suggests that: 1) amino acid composition and physicochemical properties in the flanking regions of MoRFs are very different from those in the general non-MoRF regions; 2) MoRFs contain both highly conserved residues and highly variable residues and, on the whole, are highly locally conserved; and 3) combining contextual information with local conservation information of residues facilitates the prediction of MoRFs.
KeywordsMolecular recognition features Intrinsically disordered protein Position-specific scoring matrix
With the breaking of the conventional protein paradigm of “sequence-structure-function”, the functional importance of intrinsically disordered proteins (IDPs) has become increasingly apparent. Although IDPs have no well-defined tertiary structures in their natural state, they possess essential biological functions. IDPs play critical roles in a variety of physiological processes such as signal transduction, translation regulation, and protein modification. Specifically, in interaction-mediated signaling events, IDPs possess unique advantages : (a) they lack a stable three-dimensional structure and their conformations can fluctuate over time to time; (b) they possess a combination of high specificity and low affinity; (c) they can recognize multiple partners through adoption of different conformations; (d) multiple distinct partners can bind to a common binding site of IDPs, where these partners may assume different folds. Many proteins from higher organism have been found to be entirely disordered or to contain partly disordered regions . Due to the particular properties of IDPs, conventional structure determination methods are simply inapplicable to IDPs in isolation, because no stable structure exists. However, when bound to their molecular partners, many IDPs undergo a disorder-to-order transition [1-3]. This characteristic makes it possible to obtain the structures of IDPs by crystallizing them in complexes with their molecular binding partners.
Molecular recognition features (MoRFs) are short binding regions (5-25 residues) located in IDPs regions, which easily undergo disorder-to-order transitions upon binding to partner proteins . According to their structures in the bound state, MoRFs can be divided into at least three sub-types: α-MoRFs, β-MoRFs, and ι-MoRFs, which form α-helices, β-strands, and structures without a regular pattern of backbone hydrogen bonds . MoRFs are enrichment in highly connected hub proteins, and their complexity reinforces the functional importance of the disordered regions . They act as molecular switches in molecular-interaction networks of the cell, and are assumed to be implicated in the causes of many diseases . Thus, identification of MoRFs is a key step in understanding the functions of these proteins and in finding applications in drug design.
Experimental methods for identifying MoRFs are expensive and time consuming, which makes computational methods indispensable for guiding experimental analysis. So far, because of the limited number of experimentally validated MoRFs, only four custom-built tools for predicting MoRFs are available: α-MoRF-PredI  and α-MoRF-PredII  are neural network-based predictors aimed at predicting α-MoRFs; ANCHOR  concentrates on the prediction of MoRFs which bind to globular proteins; and MoRFpred  is a comprehensive predictor which combines the annotations generated by sequence alignments with the prediction results generated by a support vector machine (SVM). These predictors use a variety of predicted features as their input, including predicted disorder probabilities [4-7], predicted solvent accessibility , predicted secondary structure propensities [4, 5], and predicted B-factors . These predicted features not only easy to result in high-dimensional feature space, but also greatly increase the complexity of algorithms. Moreover, the performance of these predictors is also largely affected by other classifiers. Thus, a more simple and efficient method for identifying MoRFs is indispensable.
A number of studies have analyzed the attributes of MoRFs [1-9]. Norman et al.  found that several strong physicochemical preferences were shown in all MoRF types compared with in general disordered regions. Fuxreiter et al.  identified that amino acid composition and charge/hydropathy properties of MoRFs exhibited a mixture characteristic of folded and disordered structures. Chica et al.  found that the flanking regions of MoRFs were relevant to linear motif-mediated interactions at both the structural and sequence levels, and that the prediction of MoRFs can be facilitated by contextual information from the protein sequence.
Because the functional sites of proteins need to maintain a high degree of conservation to execute a given function, evolutionary information included in a position-specific scoring matrix (PSSM) has been considered as the most predictive feature for identifying the functional sites of ordered proteins. MoRFs are also found to be more conserved than their surrounding residues [9, 10]. However, disordered proteins usually evolve more rapidly than ordered proteins; therefore, standard PSSMs which incorporate the conservation information of proteins are ineffective when used directly. Fortunately, relative local conservation has been proven to be a good feature for motif discovery, and has been used in a number of studies [10-12]. In addition, Shimizu et al.  found that ignoring some redundant features in standard PSSMs can improve the prediction of protein disordered regions significantly.
Taking the above information into account, we developed a novel sequence-based method for identifying MoRFs in IDPs. In this method, firstly, a masking method is used to calculate the average local conservative scores. Then, a filtering method is used to drop the scores below the average. Finally, a smoothing method is used to incorporate features of the flanking regions for each residue in the PSSM. The masking and filtering steps strengthen the highly conserved information and filter out poorly conserved information for each residue, thereby ensuring that only highly local conserved features are considered in the prediction. Moreover, the smoothing method can incorporate contextual information of neighboring residues for any given residue. All the related features are extracted only from protein sequences themselves. We used a support vector machine (SVM) to build the classifier.
In the present study, one training dataset and three test datasets were adopted. We employed the same training dataset as the one that was used by Fatemeh et al.  in their study of MoRFpred. This dataset, called as TRAINING421, includes 421 MoRF-containing chains, that contained 5,396 positive samples (MoRF residues) and 240,588 negative samples (non-MoRF residues). All the positive samples and an equal number of randomly selected negative samples were used for training our prediction model. After that, three test datasets were used for testing the developed model.
First, the two test dataset used in the study of MoRFpred  were adopted.
It included 419 MoRF-containing chains, which were deposited in PDB from April 2008, named as TEST419. They shared up to 30% sequence identity with the training dataset.
It included 45 MoRF-containing chains, which were deposited in PDB from January 1 to March 11, 2012 and in UniProtKB from February 22, 2012, named as TEST2012. They shared up to 30% sequence identity with the training dataset.
The training and the above two test datasets are available on the MoRFpred web server [http://biomine-ws.ece.ualberta.ca/MoRFpred/index.html] .
Because the TRAINING421 and TEST419 datasets were reported to contain a large number of immune response-related MoRFs (120 among the 840 MoRFs), we built another independent dataset to test whether the developed predictor was biased for some particular type of MoRFs.
It included 64 non-redundant MoRF-containing membrane proteins (50 transmembrane proteins and 14 peripheral membrane proteins), which were extracted from the study of membrane proteins reported by Ioly et al. . In their study, Ioly et al. collected 166 non-redundant MoRF-containing membrane protein sequences. We removed the sequences that either with sequence length >1000 residues, or with MoRF length >25 residues. Because sequences longer than 1000 residues cannot be processed by MoRFpred , in our study, we also aimed to predict short MoRFs (5-25 residues). After that, 64 sequences were retained, named as TESTMem64. Details of the TESTMem64 dataset are shown in Additional file 1: Table S1.
Feature analysis of MoRFs, their flanking regions, and general non-MoRF regions
The composition of the 421 protein sequences related to TRAINING421dataset was analyzed. The sequences were characterized into three regions: MoRF region, regions flanking the MoRFs, and general non-MoRF regions.
The analyses of the composition and physicochemical properties for the three regions illustrated that flanking regions are highly relevant to the MoRFs. Therefore, we assumed that the MoRFs in protein sequences are highly contextual. Because MoRFs are found to be more conserved than surrounding residues [9, 10], we considered incorporating contextual information of residues with local evolutionary conservation to improve the prediction of MoRFs.
Evolutionary information from PSSM
Evolutionary information was obtained from PSSMs, which were generated by PSI-BLAST , searching against the NCBI non-redundant (nr) database  by three-time iteration with an e-value of 0.001. Evolutionary information for each amino acid was encapsulated in a vector of 20 dimensions, where the size of PSSM matrix of a protein with N residues is 20 × N. 20 dimensions were considered as a standard amino acid size, and N is the length of the protein sequence.
Masking and filtering the PSSM
where Masking_C i represents the relative local conservation score of residue i, Ci is the standard conservation score in the PSSM, and 2n+1 is the masking-window size.
Smoothing the modified PSSM
Prediction of MoRFs can be addressed as a two-classification problem, namely, determining whether a given residue belongs to a MoRFs or not. Our prediction model was trained using the LIBSVM software package [19, 20]. Here, the Radial Basis Function (RBF kernel) was selected as the kernel function. The capacity parameter c and kernel width parameter g were then optimized using a grid search approach [19, 20].
Where TP, TN, FP and FN represents true positive, true negative, false positive and false negative respectively.
Results and discussion
Optimizing window size
Effectiveness of the feature extracting methods
Effectiveness at the individual protein level
To confirm the effectiveness of our feature extracting method in distinguishing MoRF residues from non_MoRF residues and to determine how this might benefit the predication, we selected the yeast elongin C complex with a von Hippel-Lindau peptide [PDB: 1HV2] as an example. Yeast elongin C is a signaling protein, which is highly intrinsically disordered and has a MoRF located at the region of residues 100-212 of its B chain.
Effectiveness at the whole dataset level
where '20’ is the number of standard amino acid residues, and also the number of columns in the PSSMs (PSSM is a 20 × N matrix), and N is the sequence length.
Performance tested on TEST419, TEST2012 and comparison with other feature-based methods
Originally, the direct outputs of PSSMs from PSI-BLAST have provided conversation information by default and have been widely used to predict various protein functional sites. However, there is a room for improvement, because standard PSSMs contain redundant features.
Performance comparison with four other PSSM-based methods
Performance comparison with existing predictors
Performance comparisons tested on the TEST419 and TEST2012 datasets
Performance on unbalanced training samples
Performance tested on TESTMem64 and comparison with MoRFpred
Performance comparison between MFSPSSMPred and MoRFpred tested on TESTMem64
We speculate that the reasons for the better performance of MFSPSSMpred include: (1) the MoRFpred method incorporated many predicted results, such as predicted disorder probabilities, predicted B-factor and predicted relative solvent accessibility derived from other predictors, as input for the prediction. These predicted features themselves are largely affected by the other classifiers those were used. Moreover, incorporating many predicted features can easy to result in a high-dimensional feature space; (2) MoRFpred  merges the result generated by an SVM and the result generated by sequence alignment with the MoRFs database into their final prediction result. Since there are so many immune response-related MoRFs in their database, MoRFpred is inevitably biased towards this type of MoRFs; (3) MFSPSSMpred used only the PSSM as input for prediction. It caught the point that, MoRF regions in a sequence are mingled with highly conserved residues and highly variable residues. Therefore, our approach is independent of the type or binding partners of MoRFs, and can be applied to the prediction of all MoRFs.
In this study, we propose a novel method which adopts a modified PSSM encoding scheme for MoRFs prediction. Our method employs no predicted results as input, and all input features are extracted only from the PSSMs of sequences. By means of masking, filtering and smoothing, the modified PSSMs combine predictive features, which can effectively distinguish MoRF from non_MoRF residues. When comparing with other existing methods on the same datasets, MFSPSSMpred outperformed them all, achieving 3.3%~9.3% higher ACC and 0.004~0.079 higher AUC than other methods when tested on TEST419, and 10.3%~17.2% higher ACC and 0.045~0.212 higher AUC than other methods when tested on TEST2012. Moreover, despite the training dataset being biased toward immune response-related proteins, when MFSPSSMpred was tested on an independent membrane proteins-related dataset---TESTMem64, it showed good adaptability and significantly outperformed the existing MoRFpred predictor .
In summary, this study shows that combining contextual information with local conservation information of residues is predictive for identifying MoRFs. In addition, our study revealed some hallmarks of MoRFs; namely, MoRFs are mingled with highly conserved residues and highly variable residues, and MoRFs, on the whole, are highly locally conserved and are flanked by less conserved residues. A free Web server has been developed, which allows users to identify MoRFs in a given sequence using the model trained on our dataset. It is available from URL http://webapp.yama.info.waseda.ac.jp/fang/MoRFs.php.
Molecular recognition features
Position-specific scoring matrix
Support vector machine
Intrinsically disordered proteins
Receiver operating characteristic
Area under the corresponding ROC curve
True positive rate
False negative rate.
We would like to thank Prof. Toh of Computational Biological Research Center (CBRC) providing a lot of convenience for our study, and Fatemeh’s laboratory providing their prediction results of the MoRFpred predictor. We also thank an anonymous reviewer for his/her helpful comments, which improved the manuscript.
- Mohan A, Oldfield CJ, Radivojac P, Vacic V, Cortese MS, Dunker AK, Uversky VN: Analysis of molecular recognition features (MoRFs). J Mol Biol. 2006, 362: 1043-1059. 10.1016/j.jmb.2006.07.087.View ArticlePubMedGoogle Scholar
- Vacic V, Christopher JO, Amrita M, Predrag R, Marc SC, Vladimir NU, Dunker AK: Characterization of molecular recognition features, MoRFs, and their binding partners. J Proteome Res. 2007, 6 (6): 2351-2366. 10.1021/pr0701411.PubMed CentralView ArticlePubMedGoogle Scholar
- Norman ED, Kim VR, Robert JW: Attributes of short linear motifs. Mol Biosyst. 2012, 8: 268-281. 10.1039/c1mb05231d.View ArticleGoogle Scholar
- Fatemeh MD, Wei-Lun H, Marcin JM, Christopher JO, Bin X, Dunker AK, Vladimir NU, Lukasz K: MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics. 2012, 28 (12): i75-i83. 10.1093/bioinformatics/bts209.View ArticleGoogle Scholar
- Oldfield CJ, Cheng Y, Cortese MS, Romero P, Uversky VN, Dunker AK: Coupled folding and binding with alpha-helix-forming molecular recognition elements. Biochemistry. 2005, 44: 12454-12470. 10.1021/bi050736e.View ArticlePubMedGoogle Scholar
- Yugong C, Christopher JO, Jingwei M, Pedro R, Vladimir NU, Dunker AK: Mining α-helix-forming molecular recognition features with cross species sequence alignments. Biochemistry. 2007, 46 (47): 13468-13477. 10.1021/bi7012273.View ArticleGoogle Scholar
- Dosztanyi Z, Mészáros SI: ANCHOR: web server for predicting protein binding regions in disordered proteins. Bioinformatics. 2009, 25 (20): 2745-2746. 10.1093/bioinformatics/btp518.PubMed CentralView ArticlePubMedGoogle Scholar
- Fuxreiter M, Peter T, Istvan S: Local structural disorder imparts plasticity on linear motifs. Bioinformatics. 2007, 23 (8): 950-956. 10.1093/bioinformatics/btm035.View ArticlePubMedGoogle Scholar
- Chica C, Diella F, Gibson TJ: Evidence for the concerted evolution between short linear protein motifs and their flanking regions. PLoS ONE. 2009, 4 (7): e6052-10.1371/journal.pone.0006052.PubMed CentralView ArticlePubMedGoogle Scholar
- Norman ED, Denis CS, Richard JE: Masking residues using context-specific evolutionary conservation significantly improves short linear motif discovery. Bioinformatics. 2009, 25 (4): 443-450. 10.1093/bioinformatics/btn664.View ArticleGoogle Scholar
- Norman ED, Joanne LC, Denis CS, Toby JG, Mark JC, Richard JE: SLiMPrints: conservation-based discovery of functional motif fingerprints in intrinsically disordered protein regions. Nucleic Acids Res. 2012, September, 12: 1-14.Google Scholar
- Niall JH, Denis CS: Profile-based short linear protein motif discovery. BMC Bioinformatics. 2012, 13: 104-10.1186/1471-2105-13-104.View ArticleGoogle Scholar
- Shimizu K, Hirose S, Tamotsu N: POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics. 2007, 23 (17): 2337-2338. 10.1093/bioinformatics/btm330.View ArticlePubMedGoogle Scholar
- Ioly KL, Georgios NT, Stavros JH: Analysis of molecular recognition features (MoRFs) in membrane proteins. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics. 2013, 1834 (4): 798-807. 10.1016/j.bbapap.2013.01.006.View ArticleGoogle Scholar
- Stephen FA, Thomas LM, Alejandro AS, Jinghui Z, Zheng Z: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.View ArticleGoogle Scholar
- NR. ftp://ftp.ncbi.nih.gov/blast/db/fasta/nr.gz
- Cheng-Wei C, Emily CYS, Jenn-Kang H, Ting-Yi S, Wen-Lian H: Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics. 2008, 9 (Suppl 12): S6-10.1186/1471-2105-9-S12-S6.View ArticleGoogle Scholar
- Gonzalez RC, Woods RE: Digital Image Processing. The Second Edition. 2002, Prentice HallGoogle Scholar
- Chih-Chung C, Chih-Jen L: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2 (3): 27-Google Scholar
- A library for support vector machines. [http://www.csie.ntu.edu.tw/~cjlin/libsvm/]
- CASP10. [http://predictioncenter.org/casp10/index.cgi]
- R statistical package. [http://www.r-project.org/]
- Avner S, Marco P, Guy Y, Laszlo K, Burkhard R: Improved disorder prediction by combination of orthogonal approaches. PLoS One. 2009, 4 (2): e4433-10.1371/journal.pone.0004433.View ArticleGoogle Scholar
- Zsuzsanna D, Veronika C, Peter T, István S: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 2005, 21 (16): 3433-3434. 10.1093/bioinformatics/bti541.View ArticleGoogle Scholar
- Marcin JM, Wojciech S, Ke C, Kanaka DK, Fatemeh MD, Lukasz K: Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics. 2010, 26 (18): i489-i496. 10.1093/bioinformatics/btq373.View ArticleGoogle Scholar
- Tuo Z, Eshel F, Bin X, Dunker AK, Vladimir NU, Yaoqi Z: SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method. Journal of Biomolecular Structure and Dynamics. 2012, 4 (29): 799-813.Google Scholar
- Jonathan JW, Liam JM, Kevin B, Bernard FB, David TJ: The DISOPRED server for the prediction of protein disorder. Bioinformatics. 2004, 20 (13): 2138-2139. 10.1093/bioinformatics/bth195.View ArticleGoogle Scholar
- McGuffin L: Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics. 2008, 24: 1798-1804. 10.1093/bioinformatics/btn326.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.