MOTIPS: Automated Motif Analysis for Predicting Targets of Modular Protein Domains
- Hugo YK Lam†1,
- Philip M Kim†2, 8Email author,
- Janine Mok3, 9,
- Raffi Tonikian4, 5,
- Sachdev S Sidhu4, 5,
- Benjamin E Turk6,
- Michael Snyder3, 10 and
- Mark B Gerstein1, 2, 7Email author
© Lam et al; licensee BioMed Central Ltd. 2010
Received: 3 February 2010
Accepted: 11 May 2010
Published: 11 May 2010
Many protein interactions, especially those involved in signaling, involve short linear motifs consisting of 5-10 amino acid residues that interact with modular protein domains such as the SH3 binding domains and the kinase catalytic domains. One straightforward way of identifying these interactions is by scanning for matches to the motif against all the sequences in a target proteome. However, predicting domain targets by motif sequence alone without considering other genomic and structural information has been shown to be lacking in accuracy.
We developed an efficient search algorithm to scan the target proteome for potential domain targets and to increase the accuracy of each hit by integrating a variety of pre-computed features, such as conservation, surface propensity, and disorder. The integration is performed using naïve Bayes and a training set of validated experiments.
By integrating a variety of biologically relevant features to predict domain targets, we demonstrated a notably improved prediction of modular protein domain targets. Combined with emerging high-resolution data of domain specificities, we believe that our approach can assist in the reconstruction of many signaling pathways.
Important protein-protein interactions (e.g., those involved in signal transduction) are often mediated by modular protein domains . These domains often work in a mix-and-match fashion, thereby acting as the building blocks of signaling pathways . Examples include the SH3 and WW domains that bind proline-rich motifs , and the serine/threonine kinase domain that specifically phosphorylates the hydroxyl group of serine and threonine . Throughout we will refer to these collectively as "domains". Since these kinds of domains play an important role in the assembly, regulatory and signaling activities of the cell [3, 5, 6], accurate prediction of their targets is crucial to understanding many biological pathways [7, 8].
As a result, various techniques have been developed to predict domain targets and to enhance the prediction. Earlier studies have tried to use consensus sequences from phage display experiments to predict the targets of peptide-binding domains . Also, a modern peptide library screening approach, which is commonly used to determine phosphorylation motifs for kinases, has shown to have high accuracy in determining domain specificity . Both approaches have in common that they identify the specificity of each domain in a position-specific manner, yielding a Position Specific Scoring Matrix (PSSM; also known as Position Weight Matrix, PWM). Furthermore, many studies have demonstrated various ways to improve prediction performance using genomic information. For instance, comparative genomics and secondary structure information have been used to increase the performance of SH3 target prediction [11, 12].
Nevertheless, to date the prediction of biologically relevant targets of these domains has yet to be addressed in an automated and integrated fashion. To this end, we present an automated process, which integrates comparative genomic (i.e., sequence conservation) and structural genomic (i.e., surface propensity and peptide disorder) data with traditional profile scanning method to predict domain targets based on experimental screening result (e.g. peptide library screening) or their derived PSSMs. The process is fully automated and implemented as an online server. The implementation is open-source and also available for download at http://motips.gersteinlab.org.
Results and Discussion
An Automated Pipeline Process
Our approach first converts the input data into a PSSM and then normalizes it. Secondly, it scans the target proteome by using the normalized PSSM and generates a hit list of potential domain targets. Following the motif scanning, it computes the conservation score, solvent accessibility score, and disorder score for each motif hit based on the pre-computed scores for each protein residue. It then integrates these genomic features with the motif matching scores and the number of hits per protein by naïve Bayes to predict the optimal targets based upon a validated training set. Lastly, it sorts the motif hits by their likelihood of having interaction with the domain and consolidates them into unique protein hits.
Data Conversion and Normalization
where Z ca is the normalized score for amino acid a at position c, which has a signal score S ca , and m is the total number of amino acids. Equation (1) thus computes the weight for each amino acid at each position and scales it up by the total number of amino acids. However, to consider the known specificity for domains such as the serine/threonine kinase domain, which have fixed amino acid targets (e.g., serine and threonine) at a certain position in the binding motif, a score of 0 is automatically assigned to every other amino acid that is not expected at that position. To indicate the slight probability of observing the fixed amino acids at other positions, a pseudo-count of 1 is assigned to each of them at these non-specific positions.
where q ia is the substitution probability for amino acid a replaced by i, and Q i is the substitution probability for a replaced by any amino acid. In addition to the pseudo-count method based on substitution probabilities, we also provide alternative pseudo-count methods based on flat counting (adding 1 to all values) and entropy (adding a pseudo-count proportional to the entropy of each position to its corresponding values).
Motif Scanning and Scoring
To improve the efficiency of the scanning algorithm, each motif hit is compared immediately to a sorted hit list of fixed size (currently 2,000 hits) and will only be retained if it has a more significant score than the least significant one in the list.
Structural Features and Scoring
The degree of surface propensity of a given sequence is measured by its relative solvent accessibility, which represents the extent of residue solvent exposure. It is predicted by a protein structure prediction program, SABLE, which uses a neural network-based regression algorithm . To measure the disorder of the sequence, DISOPRED, a neural networks and PSI-BLAST-based approach is used to estimate the probability of the region being disordered [20, 21]. For measuring the conservation of the sequence structure, orthologs of the sequence are identified using INPARANOID . Following the ortholog identification, the sequences in the orthologous groups are aligned with MUSCLE  and a conservation score for each position in the sequence is estimated by its entropy using AL2CO .
For each protein in each proteome being studied, the solvent accessibility, disorder and conservation scores are pre-computed for each residue. As a result, the scores for the motif hits could be calculated in a timely manner.
Feature Integration and Target Prediction
where I is the class variable (i.e., interaction or non-interaction), F is the feature such as the motif scanning score, and n is the total number of features. To assess the independence of the features, pair-wise correlation coefficients were calculated. The results showed the pair-wise correlation coefficients have an average of 0.23 for the SH3 model and 0.18 for the S/T kinase model, indicating the features are to a large extent independent. Furthermore, since the independency assumption is not harmful for data pre-processed with Principal Component Analysis (PCA) , we performed PCA to transform the possibly correlated features into uncorrelated features. The first three principal components were chosen to build a naïve Bayes model followed by a stratified 10-fold cross-validation. The Area Under Curve (AUC; 89.1 for the SH3 model and 75.9% for the S/T kinase model) of the Receiver Operating Curve (ROC) resulting from the PCA transformation was then compared to the AUC (91.8% for the SH3 model and 78.6% for the S/T kinase model) without PCA. No significant deviation of performance was observed between the predictions without PCA and those with PCA, indicating no strong dependency among the original features.
Finally, the motif hits from the domain of interest are classified under the selected model and sorted by their likelihood of having an interaction with the domain. Hits for the same protein are consolidated into one single hit represented by the most likely target. Genomic information that is not used in the prediction, such as protein-protein interaction data, localization data and phosphorylome data, could also be integrated easily with the tab-delimited hit list for further analysis while phosphorylation prediction data from mass spectrometry experiments can be used as cross-validation.
System Implementation and Availability
The motif analyzing process mentioned above is implemented as an online server, which allows researchers to upload their experimental data representing the motifs of the domains and to predict the targets. Our pipeline supports various input data formats. For specific analysis software, it currently supports the Gene Pix Result format http://www.moleculardevices.com/pages/software/gn_genepix_file_formats.html#gpr that is usually used for peptide library screening data, and the BRAIN project's peptide format http://www.baderlab.com/Software/BRAIN/PeptideFile that is usually used for phage display experiments. For general purposes, it supports the FASTA format (i.e., a set of peptides with the same length that represent the possible interacting sites) and the Nx20 format (i.e., a tab-delimited format that represents the positional scores of a motif profile with the first row labeled with the amino acid residues and the subsequent rows as the different positions). The pipeline currently has a compilation of 20 proteomes consisting of 14 yeast proteomes (S. cerevisiae, C. albicans, D. hansenii, C. glabrata, K. lactis, N. crassa, S. bayanus, S. castelli, S. kluyveri, S. kudriavzevii, S. mikatae, S. paradoxus, S. pombe, Y. lipolytica), 2 worm proteomes (C. briggsae, C. elegans), and 4 mammalian proteomes (C. familiaris, P. troglodytes, M. musculus, H. sapiens).
The feature scores were pre-computed and the default prediction models, which could be replaced by a user-defined training set (a tab-delimited file with the gene on the first column and a logical value on the second indicating the interaction), were also built. Moreover, the analyzing process is implemented as an asynchronous multi-threading pipeline process so the prediction results can be delivered to the users via email offline, in addition to being displayed online. Furthermore, the entire system is built using the Java programming language under a Model View Controller architecture in which the analysis process is implemented as a standalone open-sourced program. Therefore, the process could be customized by researchers and executed in command line on multiple platforms. The naïve Bayes classification is performed using Weka, the open-source Java data mining software .
The standalone pipeline and database are available for download at the MOTIPS server at http://motips.gersteinlab.org.
By integrating a variety of biologically relevant features and using a Bayesian learning algorithm to predict domain targets, our approach has improved the domain binding and phosphorylation target predictions notably compared to using only profile-matching scan. We believe our approach is versatile enough to predict targets of domains of different kinds, and its implementation as an online public server could facilitate researchers in predicting domain targets more accurately.
We acknowledge support from the NIH and from the AL Williams Professorship funds. We would also like to thank Chong Shou for proofreading the manuscript and Kevin Yip for the discussion on the PCA.
- Zarrinpar A, Park SH, Lim WA: Optimization of specificity in a cellular protein interaction network by negative selection. Nature 2003, 426(6967):676–80. 10.1038/nature02178View ArticlePubMedGoogle Scholar
- Pawson T, Nash P: Assembly of cell regulatory systems through protein interaction domains. Science 2003, 300(5618):445–52. 10.1126/science.1083653View ArticlePubMedGoogle Scholar
- Zarrinpar A, Bhattacharyya RP, Lim WA: The structure and function of proline recognition domains. Sci STKE 2003., 2003(179): 10.1126/stke.2003.179.re8Google Scholar
- Hanks SK, Quinn AM, Hunter T: The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. Science 1988, 241(4861):42–52. 10.1126/science.3291115View ArticlePubMedGoogle Scholar
- Zeng G, Cai M: Regulation of the actin cytoskeleton organization in yeast by a novel serine/threonine kinase Prk1p. J Cell Biol 1999, 144(1):71–82. 10.1083/jcb.144.1.71View ArticlePubMedPubMed CentralGoogle Scholar
- Pawson T: Protein modules and signalling networks. Nature 1995, 373(6515):573–80. 10.1038/373573a0View ArticlePubMedGoogle Scholar
- Tong AH, Drees B, Nardelli G, Bader GD, Brannetti B, Castagnoli L, Evangelista M, Ferracuti S, Nelson B, Paoluzi S, Quondam M, Zucconi A, Hogue CW, Fields S, Boone C, Cesareni G: A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 2002, 11(295(5553)):321–4. 10.1126/science.1064987View ArticleGoogle Scholar
- Landgraf C, Panni S, Montecchi-Palazzi L, Castagnoli L, Schneider-Mergener J, Volkmer-Engert R, Cesareni G: Protein interaction networks by proteome peptide scanning. PLoS Biol 2004, 2(1):E14. 10.1371/journal.pbio.0020014View ArticlePubMedPubMed CentralGoogle Scholar
- Tonikian R, Zhang Y, Boone C, Sidhu SS: Identifying specificity profiles for peptide recognition modules from phage-displayed peptide libraries. Nat Protoc 2007, 2(6):1368–86. 10.1038/nprot.2007.151View ArticlePubMedGoogle Scholar
- Hutti JE, Jarrell ET, Chang JD, Abbott DW, Storz P, Toker A, Cantley LC, Turk BE: A rapid method for determining protein kinase phosphorylation specificity. Nat Methods 2004, 1(1):27–9. 10.1038/nmeth708View ArticlePubMedGoogle Scholar
- Beltrao P, Serrano L: Comparative genomics and disorder prediction identify biologically relevant SH3 protein interactions. PLoS Comput Biol 2005, 1(3):e26. 10.1371/journal.pcbi.0010026View ArticlePubMedPubMed CentralGoogle Scholar
- Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 2003, 302(5644):449–453. 10.1126/science.1087361View ArticlePubMedGoogle Scholar
- Henikoff JG, Henikoff S: Using substitution probabilities to improve position-specific scoring matrices. Comput Appl Biosci 1996, 12(2):135–43.PubMedGoogle Scholar
- Eddy SR: Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 2004, 22(8):1035–6. 10.1038/nbt0804-1035View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89(22):10915–9. 10.1073/pnas.89.22.10915View ArticlePubMedPubMed CentralGoogle Scholar
- McLachlan AD: Repeating sequences and gene duplication in proteins. J Mol Biol 1972, 64(2):417–37. 10.1016/0022-2836(72)90508-6View ArticlePubMedGoogle Scholar
- Yaffe MB, Leparc GG, Lai J, Obata T, Volinia S, Cantley LC: A motif-based profile scanning approach for genome-wide prediction of signaling pathways. Nat Biotechnol 2001, 19(4):348–53. 10.1038/86737View ArticlePubMedGoogle Scholar
- Obenauer JC, Cantley LC, Yaffe MB: Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res 2003, 31(13):3635–41. 10.1093/nar/gkg584View ArticlePubMedPubMed CentralGoogle Scholar
- Adamczak R, Porollo A, Meller J: Accurate prediction of solvent accessibility using neural networks-based regression. Proteins 2004, 56(4):753–67. 10.1002/prot.20176View ArticlePubMedGoogle Scholar
- Jones DT, Ward JJ: Prediction of disordered regions in proteins from position specific score matrices. Proteins 2003, 53(Suppl 6):573–8. 10.1002/prot.10528View ArticlePubMedGoogle Scholar
- Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server for the prediction of protein disorder. Bioinformatics 2004, 20(13):2138–9. 10.1093/bioinformatics/bth195View ArticlePubMedGoogle Scholar
- Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 2001, 314(5):1041–52. 10.1006/jmbi.2000.5197View ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792–7. 10.1093/nar/gkh340View ArticlePubMedPubMed CentralGoogle Scholar
- Pei J, Grishin NV: AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 2001, 17(8):700–12. 10.1093/bioinformatics/17.8.700View ArticlePubMedGoogle Scholar
- Tonikian R, et al.: Bayesian modeling of the yeast SH3 domain interactome predicts spatiotemporal dynamics of endocytosis proteins. PLoS Biol 2009, 7(10):e1000218. 10.1371/journal.pbio.1000218View ArticlePubMedPubMed CentralGoogle Scholar
- Mok J, et al.: Deciphering protein kinase specificity through large-scale analysis of yeast phosphorylation motifs. Sci Signal 2010, 3(109):ra12. 10.1126/scisignal.2000482View ArticlePubMedPubMed CentralGoogle Scholar
- Turhan B, Bener A: Analysis of Naive Bayes' assumptions on software fault data: An empirical study. Data Knowl Eng 2009, 68(2):278–290. 10.1016/j.datak.2008.10.005View ArticleGoogle Scholar
- Puntervoll P, et al.: ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 2003, 31: 3625–3630. 10.1093/nar/gkg545View ArticlePubMedPubMed CentralGoogle Scholar
- Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka. Bioinformatics 2004, 20(15):2479–81. 10.1093/bioinformatics/bth261View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.