Prediction of nuclear proteins using SVM and HMM models
© Kumar and Raghava; licensee BioMed Central Ltd. 2009
Received: 07 August 2008
Accepted: 19 January 2009
Published: 19 January 2009
The nucleus, a highly organized organelle, plays important role in cellular homeostasis. The nuclear proteins are crucial for chromosomal maintenance/segregation, gene expression, RNA processing/export, and many other processes. Several methods have been developed for predicting the nuclear proteins in the past. The aim of the present study is to develop a new method for predicting nuclear proteins with higher accuracy.
All modules were trained and tested on a non-redundant dataset and evaluated using five-fold cross-validation technique. Firstly, Support Vector Machines (SVM) based modules have been developed using amino acid and dipeptide compositions and achieved a Mathews correlation coefficient (MCC) of 0.59 and 0.61 respectively. Secondly, we have developed SVM modules using split amino acid compositions (SAAC) and achieved the maximum MCC of 0.66. Thirdly, a hidden Markov model (HMM) based module/profile was developed for searching exclusively nuclear and non-nuclear domains in a protein. Finally, a hybrid module was developed by combining SVM module and HMM profile and achieved a MCC of 0.87 with an accuracy of 94.61%. This method performs better than the existing methods when evaluated on blind/independent datasets. Our method estimated 31.51%, 21.89%, 26.31%, 25.72% and 24.95% of the proteins as nuclear proteins in Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, mouse and human proteomes respectively. Based on the above modules, we have developed a web server NpPred for predicting nuclear proteins http://www.imtech.res.in/raghava/nppred/.
This study describes a highly accurate method for predicting nuclear proteins. SVM module has been developed for the first time using SAAC for predicting nuclear proteins, where amino acid composition of N-terminus and the remaining protein were computed separately. In addition, our study is a first documentation where exclusively nuclear and non-nuclear domains have been identified and used for predicting nuclear proteins. The performance of the method improved further by combining both approaches together.
The genome of the large number of organisms has been completely sequenced or in the final stage of completion due to the advancement in the technology. Thus, the functional annotation of proteomes is one of the major challenges in the post genomic era as the numbers of protein with known sequences are growing at exponential rate. The experimental techniques for assigning the functions are slow, costly and cumbersome. In order to assist the biologists in functional annotation of the proteomes, large number of computational methods has been developed. Similarity search is one of the most commonly-used techniques for assigning the function of a newly sequenced protein. However, it fails if query/new protein does not have sequence similarity with a protein whose function is known.
One of the indirect techniques that are used for assigning function of a protein is the prediction of its subcellular localization. As the function of a protein is closely related to its cellular attributes, the related proteins must be localized in the same cellular compartment to cooperate toward a common function. In the past, large number methods have been developed for predicting the subcellular localization of proteins and most of them were developed for predicting multiple locations. Though multi-location prediction methods provide comprehensive information, they are not optimized for a particular location. Hence, recent studies are focused on the development of methods for predicting proteins in specific location [1–3].
One of the important compartments of a eukaryotic cell is nucleus, which is essential for regulating various biological activities. Thus there is a need to develop an accurate method for predicting nuclear proteins. In the past, several methods have been developed for predicting nuclear proteins. PredictNLS was the first method developed using Nuclear Localization Signal (NLS) . Heddad et al developed a genetic programming based method NucPred that tried to compile a list of potential NLSs . Recently NucPred has been evaluated on a new dataset  and its performance was found to be better than the PredictNLS, LOCtree  and BaCelLo  (generalized subcellular localization methods). In the present work, a systematic attempt has been made to predict nuclear proteins with high accuracy. All the nuclear and non-nuclear proteins have been analyzed in order to understand major features of the nuclear proteins. Based on these observations, SVM based modules have been developed for predicting nuclear proteins. In addition, we have developed HMM based module which used both exclusive nuclear and non-nuclear domains.
Analysis of amino acid composition
SVM modules of amino acid and dipeptide composition
The performance of SVM models using various types of composition.
SAAC 2-parts (equal)
SAAC 3-parts (equal)
SAAC 4-parts (equal)
Previouly, the dipeptide composition has been successfully used for predicting subcellular localization of proteins [10, 11]. Hence in this study, we have developed SVM module using dipeptide composition and achieved the maximum MCC of 0.61 (with overall accuracy 82.83%) using polynomial kernel. This showed that the performance of SVM model based on dipeptide was better than that based on amino acid composition.
Split amino acid composition (SAAC)
Occurrence of Pfam domains
The performance of hybrid module, which combines HMM and SVM model (using NT25+R).
Benchmarking of methods
The performance of different subcellular localization methods on blind/independent dataset used in BaCelLo.
Blind1 Dataset(Animal Proteins)
Blind2 Dataset(Fungal Proteins)
The performance of nuclear protein prediction methods on 2213 human proteins (1526 non-nuclear and 687 nuclear).
(Probability of correct
prediction of nuclear proteins)
NucPred (0.8 threshold)
NucPred (0.5 threshold)
NucPred (0.8) AND PredictNLS
NucPred (0.8) OR PredictNLS
A webserver NpPred has been developed for predicting nuclear proteins. It allows users to submit up to 1000 sequences at a time for prediction. NpPred has been developed using programming language Perl, CGI-perl and HTML, launch on SUN server T1000 under Solaris 10.0 environment. This server is available from URL http://www.imtech.res.in/raghava/nppred/ for academic users [see Figure S2 in Additional file 1].
The prediction results are displayed in a user-friendly format. The result page first displays the prediction parameters like approach of prediction, SVM threshold, e-value of Pfam domain search. In case of only SVM based prediction, SVM score along with prediction result is displayed [see Figure S3 in Additional file 1]. The result page of SVM and Pfam based hybrid approach prediction, Pfam domain and their nature of existence (exclusive nuclear/non-nuclear), SVM score and final prediction of each query sequence will be displayed [see Figure S4 in Additional file 1].
Within each subcellular compartment of a given cell type, proteins have co-evolved according to the surrounding physico-chemical environment. However, the general features of the nuclear environment have been constant factors throughout eukaryotic evolution. These factors pose environmental constraints on the evolution of protein sequence and structure so that the proteins will have to adapt to the different environmental constraints. If this hypothesis is true then instead of simply searching the single amino acid sequence, the better approach is to search stretch of sequences that are known to be conserved in similar types of proteins. Taken this into consideration, we have used Pfam domain database and extracted three type of domains namely exclusive nuclear, exclusive non-nuclear and shared. Proteins having exclusive nuclear or non-nuclear domain, were predicted as nuclear and non-nuclear proteins respectively. This approach was able to predict only 1858 proteins as nuclear. In first impression, it seems that the profile search approach is not very efficient. But if we also consider the proteins which were not classified into any class due to the presence of shared domains then it is clear that this approach has the capability to filter out all sequence which can be wrongly classified by BLAST. Because if BLAST search was done on proteins having shared domain they might be classified wrongly as non-nuclear. In reality, these proteins may not either present in nucleus or having only a transient stay. Hence the actual challenge is to increase the coverage and decrease the number of false predictions.
Among all the cellular organelles of eukaryotic cell, nucleus is very interesting organelle. Unlike other organelles it is not strictly isolated from cytoplasm due to the presence of nuclear pore complexes. Nuclear pore complex is permeable to small (<5 kDa) neutral molecules . Presence of chromosome and their regulatory proteins make nucleus the central point of gene regulation. Hence prediction of nuclear proteins can be an important step in understanding of their function and building protein networks. In this study, an attempt has been made to improve the accuracy of nuclear protein prediction. First we analyzed and compared the composition of nuclear and non-nuclear proteins. It was observed that certain amino acids are more prominent in nuclear proteins where as few others are more prominent in non-nuclear proteins. This observation reveals the possibility of discriminating nuclear and non-nuclear proteins on the basis of amino acid composition. Based on these observations, SVM models have been developed using amino acid composition and achieved a reasonable accuracy. It has been shown in the past that dipeptide composition based models perform better than amino acid composition based models because dipeptide also provides information about local order. As shown in Table 1, SVM model based on dipeptide was better than amino acid composition based model. Due to the presence of NLS, the first logical step in prediction of nuclear protein is to search for signal peptides in the sequence. But it has been shown in previous studies that prediction coverage by using NLS is very low . Moreover we can miss the proteins, which do not have NLS or are transported into nucleus as a complex with other protein. The alternative approach would be to infer location by the sequence homology to a protein with known location. Similarity based annotation is said to be highly accurate if an experimentally annotated homologous protein is present in the database. But in their study, Cokol et al  have observed different scenario. They have found about 30 protein pairs with >80% sequence identity and different subcellular location. This shows that there is a possibility of interpreting wrongly even if searching is done on a very clean and experimentally annotated data. The chances of wrong annotation by BLAST search increases many folds if a general database such as SWISS-PROT is used. In addition there is also a chance of not getting any hit during BLAST search which results in the reduction of total proteome size. This shows the limitation of BLAST searching.
In order to increase the coverage we have developed SVM modules based on different form of amino acid composition. We have found maximum performance with NT25+R composition based model. In order to exploit the benefits of generalized SVM based prediction as well as profile search, we have also developed hybrid prediction modules NpPred.
NpPred was also evaluated in comparison with different 'subcellular localization' prediction methods using two independent datasets. First dataset contains fungi and animal proteins extracted from SWISS-PROT release 41 to 48 . Our training dataset (data_main) contain protein sequences only up to SWISS-PROT release 40.41. It means that the independent dataset sequences were not included during five-fold cross-validation phase. On this dataset performance of NpPred is superior to seven other prediction methods. The second independent dataset contains only human protein sequences, which were earlier used for benchmarking of NucPred . Even on this dataset NpPred performed better than the other five methods. All these demonstrate that the method described in this study, performs better than the existing subcellular localization methods for the prediction of the nuclear proteins. In summary, this method will complement the existing 'subcellular localization' methods in prediction of nuclear proteins.
The nucleus is a highly complex organelle that houses the genome and their corresponding regulatory factors. Hence the prediction of nuclear proteins can be an important step towards understanding the gene regulatory mechanism and their interactions. We developed a highly accurate genome-scale nuclear proteins prediction method NpPred. First a domain database NucPfam has been developed that classifies the domains on the basis of their occurrence in nuclear and non-nuclear proteins. In case a protein contains no domains then SVM module was used for prediction. The five-fold cross-validation method showed an accuracy of 94.61% using NpPred. Furthermore, NpPred was used to predict the nuclear proteins in five representative proteomes. These genome-scale predictions and NucPfam domain database can provide an excellent starting point for experimentalists to improve the functional annotation of proteins. A web-server NpPred has also been developed to make the prediction method available to the scientific community. We hope that NpPred would able to expedite the rate of protein function prediction. The only limitation we could perceive in the present work is that we have considered only the steady-state localizations of the proteins and did not take in to account the proteins that enter the nucleus in a transient or temporally regulated manner. Therefore, the algorithm developed is only aimed at finding proteins that are nuclear at steady-state. An ideal method should address the transient localization also.
Main dataset (data_main)
The selection of dataset is the most important consideration during development of a prediction method. The sequence used for training should have high quality curation and should not contain the proteins belonging to gene families and homologous genes from various organisms. If the proteins in the dataset have high similarity among each other then the method will show very high accuracy during training but it will be not very effective during real life prediction. Hence the final dataset is created in such a way that representative proteins will not have sequence similarity more than a certain threshold limit. This type of dataset is called as non-redundant dataset. In this work we used the non-redundant database of 10372 eukaryotic proteins obtained from Guda et al, 2004 . This dataset has been used earlier for developing MITOPRED  and Mitpred . Originally these sequences were extracted from Swiss-Prot release 40.41 http://www.ebi.ac.uk/swissprot/. It consists of proteins having experimentally determined subcellular locations (cytoplasm = 1712, nucleus = 2710, mitochondria = 1432, extracellular or secretory = 3471, endoplasmic reticulum = 644, plasma membrane = 108, golgi complex = 142 and peroxisome = 153). Sequences with low quality annotation such as 'by similarity', 'potential', 'probable' and 'possible' were not included in the dataset. In this study, 2710 nuclear proteins were used as positive example and remaining 7662 as negative examples.
Blind or independent dataset
We use three different types of blind datasets obtained from different sources i) Blind1 dataset have 363 nuclear and 344 non-nuclear animal proteins, earlier used in BaCelLo for benchmarking of different eukaryotic subcellular localization methods , ii) Blind2 dataset have 122 nuclear and 57 non-nuclear fungal proteins also used in BaCelLo  and iii) Blind3 dataset consists of 687 nuclear and 1526 non-nuclear human proteins used for benchmarking NucPred .
Performance evaluation and parameters
Jackknife or leave one out cross-validation is considered to be the most rigorous test for evaluation of performance . But it usually takes very long time to perform jackknife test. As a compromise, we use the less rigorous 5-fold cross-validation where proteins of each class were randomly divided into five sets [2, 11, 16]. Four parts were used for training and remaining one part for testing. This process was repeated five times so that each set was used once for testing.
Where TP and TN are correctly predicted positive (nuclear) and negative (non-nuclear) proteins respectively. FP and FN are wrongly predicted nuclear and non-nuclear proteins respectively.
Where comp(i) and dpep(j) are amino acid and dipeptide composition of residue type i and dipeptide of type j . N is total number of amino acids in protein.
Split amino acid composition
In the case of split amino acid composition, protein sequence was divided into non-overlapping fragments then composition of each fragment was calculated independently. Thus the dimension of final input vector will be n × 20 dimensions, where n is number of fragments. In this study, proteins were divided into (i) two parts (ii) three parts and (iv) four parts.
Composition of terminal residues
It has been observed that protein may have localization signal at N- or C-terminus. In order to exploit this knowledge, we developed models using N-terminus or C-terminus composition and composition of remaining portion of a protein.
Support Vector Machine
In this study we implemented SVM using SVM_light package http://svmlight.joachims.org/, which allows us to choose a number of parameters and kernels (e.g. linear, polynomial, radial basis function, sigmoid or any user-defined kernel). The selection of kernel is very important in SVM, which is analogous to choose architecture in ANN. In this study we used linear, polynomial and RBF kernels. For detail descriptions of SVM please refer .
Occurrence of Pfam domains
In the present study, hidden Markov model (HMM) based searching was implemented using HMMER http://hmmer.janelia.org/ and proteins were searched against Pfam domain database  (version 21.0). Pfam contains multiple sequence alignments and hidden Markov models of protein domains and families. Each protein of data_main was searched against Pfam database using HMMER at e-value of 1e-5. Search results were analyzed to detect three type of domains; (i) exclusive nuclear domains (occurs only in nuclear proteins) (ii) exclusive non-nuclear domains (found only in non-nuclear proteins) and (iii) shared domains (present in both type of proteins). A protein was assigned nuclear or non-nuclear protein if it contains exclusive nuclear or non-nuclear domain respectively.
Annotation of proteomes
We annotate five eukaryotic proteomes using the method developed in this study. These proteomes S. cerevisae, C. elegans, D. melanogaster, M. musculus and H. sapiens were downloaded from EBI http://www.ebi.ac.uk/integr8 and contains 5780, 22437, 16251, 32895 and 38213 proteins respectively.
Availability and requirements
The NpPred prediction system using SVM and HMM based Pfam search for nuclear proteins prediction has been implemented at http://www.imtech.res.in/raghava/nppred. In order to use NpPred the user should have access to the internet. Using our prediction server the user can do prediction of nuclear proteins.
Authors are thankful to Drs Alok K. Mondal and M. Michael Gromiha for critically reading the manuscript. This work was supported by grants from Council of Scientific and Industrial Research (CSIR) and Department of Biotechnology (DBT), Government of India for financial assistance. Manish Kumar is a senior research fellow of CSIR. This research article has IMTech communication number 041/2007.
- Guda C, Fahy E, Subramaniam S: MITOPRED: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics 2004, 20(11):1785–1794. 10.1093/bioinformatics/bth171View ArticlePubMedGoogle Scholar
- Kumar M, Verma R, Raghava GPS: Prediction of mitochondrial proteins using support vector machine and hidden Markov model. J Biol Chem 2006, 281(9):5357–5363. 10.1074/jbc.M511061200View ArticlePubMedGoogle Scholar
- Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 2000, 300(4):1005–1016. 10.1006/jmbi.2000.3903View ArticlePubMedGoogle Scholar
- Cokol M, Nair R, Rost B: Finding nuclear localization signals. EMBO Rep 2000, 1(5):411–415. 10.1093/embo-reports/kvd092PubMed CentralView ArticlePubMedGoogle Scholar
- Heddad A, Brameier M, MacCallum RM: Evolving regular expression-based sequence classifiers for protein nuclear localization. 2nd European Workshop on Evolutionary Computation and Bioinformatics (EvoBIO 2004): 2004; Coimbra, Portugal 2004, 31–40.Google Scholar
- Brameier M, Krings A, MacCallum RM: NucPred–predicting nuclear localization of proteins. Bioinformatics 2007, 23(9):1159–1160. 10.1093/bioinformatics/btm066View ArticlePubMedGoogle Scholar
- Nair R, Rost B: Mimicking cellular sorting improves prediction of subcellular localization. J Mol Biol 2005, 348(1):85–100. 10.1016/j.jmb.2005.02.025View ArticlePubMedGoogle Scholar
- Pierleoni A, Martelli PL, Fariselli P, Casadio R: BaCelLo: a balanced subcellular localization predictor. Bioinformatics 2006, 22(14):e408–416. 10.1093/bioinformatics/btl222View ArticlePubMedGoogle Scholar
- Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16(5):412–424. 10.1093/bioinformatics/16.5.412View ArticlePubMedGoogle Scholar
- Bhasin M, Raghava GPS: ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res 2004, (32 Web Server):W414–419. 10.1093/nar/gkh350
- Garg A, Bhasin M, Raghava GPS: Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem 2005, 280(15):14427–14432. 10.1074/jbc.M411789200View ArticlePubMedGoogle Scholar
- Rashid M, Saha S, Raghava GPS: Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs. BMC Bioinformatics 2007, 8(1):337. 10.1186/1471-2105-8-337PubMed CentralView ArticlePubMedGoogle Scholar
- 13. Xie D, Li A, Wang M, Fan Z, Feng H: LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Res 2005, (33 Web Server):W105–110. 10.1093/nar/gki359Google Scholar
- Dingwall C, Laskey RA: Protein import into the cell nucleus. Annu Rev Cell Biol 1986, 2: 367–390. 10.1146/annurev.cb.02.110186.002055View ArticlePubMedGoogle Scholar
- Chou KC, Zhang CT: Prediction of protein structural classes. Crit Rev Biochem Mol Biol 1995, 30(4):275–349. 10.3109/10409239509083488View ArticlePubMedGoogle Scholar
- Bhasin M, Garg A, Raghava GPS: PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics 2005, 21(10):2522–2524. 10.1093/bioinformatics/bti309View ArticlePubMedGoogle Scholar
- Matthews BW: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 1975, 405(2):442–451.View ArticlePubMedGoogle Scholar
- 18. Joachmis M, ed: Making large scale SVM learning practical. Cambridge: MIT Press; 1999.Google Scholar
- 19. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et al.: Pfam: clans, web tools and services. Nucleic Acids Res 2006, (34 Database):D247–251. 10.1093/nar/gkj149Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.