NLStradamus: a simple Hidden Markov Model for nuclear localization signal prediction
© Nguyen Ba et al; licensee BioMed Central Ltd. 2009
Received: 25 March 2009
Accepted: 29 June 2009
Published: 29 June 2009
Nuclear localization signals (NLSs) are stretches of residues within a protein that are important for the regulated nuclear import of the protein. Of the many import pathways that exist in yeast, the best characterized is termed the 'classical' NLS pathway. The classical NLS contains specific patterns of basic residues and computational methods have been designed to predict the location of these motifs on proteins. The consensus sequences, or patterns, for the other import pathways are less well-understood.
In this paper, we present an analysis of characterized NLSs in yeast, and find, despite the large number of nuclear import pathways, that NLSs seem to show similar patterns of amino acid residues. We test current prediction methods and observe a low true positive rate. We therefore suggest an approach using hidden Markov models (HMMs) to predict novel NLSs in proteins. We show that our method is able to consistently find 37% of the NLSs with a low false positive rate and that our method retains its true positive rate outside of the yeast data set used for the training parameters.
Our implementation of this model, NLStradamus, is made available at: http://www.moseslab.csb.utoronto.ca/NLStradamus/
Eukaryotic cells are defined by the presence of their nucleus. The nuclear membrane enclosing the genetic material of the cell is selective in its import of material through its nuclear pores and this translocation is mediated by cellular mechanisms [1, 2].
Proteins entering the nucleus must do so through proteins forming the nuclear pores: the nuclear pore complex [3, 4]. The pores allow the passive diffusion of small proteins, but bigger proteins entering the nucleus are usually bound by karyopherin complexes on their nuclear localization signal . Although there are many nuclear import pathways in eukaryotic cells, most of these have not been characterized in detail. The best understood is the classical NLS pathway. The recognition of classical NLSs on nuclear proteins is done by the importin-α subunit which in turn is recognized by the importin-β subunit. This trimer (cargo, importin-α and importin-β) is then imported to the nucleus after series of enzymatic steps [1, 6]. Other families of NLSs are independent of importin-α, and may bind directly to one of the members of the importin-β superfamily .
Classical NLSs show characteristic patterns of basic residues loosely matching two consensus sequences, K(K/R)X(K/R) and KRX10–12KRXK, termed the 'monopartite' and 'bipartite' classical NLS [1, 2]. PSORT  accurately predicts protein localization by including heuristic scores based on residues frequencies, protein domains as well as motifs. PSORT will then offer, based on the cNLS consensus sequences, various possible NLSs on nuclear proteins.
However, many known NLSs do not match any of the consensus sequences described above. Therefore, other computational methods have been proposed to predict NLSs based on amino acid sequences. One method, PredictNLS , employs a database of regular expressions to predict the various types of NLSs.
Here we sought to test the power of these methods to predict NLSs in Saccharomyces cerevisiae. We compiled a list of experimentally verified NLSs in yeast and found that PredictNLS  predicts 10% of the NLSs in this set. Based on an analysis of the residues frequencies in the NLSs in this set, we developed a simple hidden Markov model  that can be used to predict the nuclear localization signal in proteins, achieving true positive rate of 37%. We also explored using specific models for the monopartite and bipartite NLSs, but surprisingly, found no significant improvement in positive predictive value. Finally, we also show that our method fares well in species other than yeast, consistent with the conservation of import mechanisms.
A set of experimentally verified nuclear localization signals in yeast
In order to test the predictive power of computational methods to predict NLSs in S. cerevisiae, we sought to identify a set of experimentally confirmed NLSs. We searched the literature for papers in which specific stretches of amino acids have been shown to act as functional NLSs in yeast. While many approaches have been used to characterize and identify nuclear localization signals in proteins, we took as the NLS the minimal sequence in the protein that was proven to be either necessary or sufficient for nuclear import.
To categorize NLSs into their respective import pathways, we looked for evidence of their respective pathways which includes receptor binding or pathway dependency. However, only half of our proteins had a clear reference to these, and many remain unknown.
non-cNLS and cNLS sets
Necessity or Sufficiency (PMID)
Receptor binding or pathway dependency (PMID)
unknown NLS set
Necessity or Sufficiency (PMID)
Current NLS prediction methods show little predictive power on our yeast data
Probabilistic models have been widely used to predict domains in proteins (E.g., Pfam ). Profile HMMs model a linear series of states which approximately corresponds to the pattern of residues in a consensus motif. These models can account for length variation using insertion and deletion states.
Taken together, our results indicate that both frameworks show some positive predictive value but PredictNLS shows a low true positive rate, and both methods show PPV of around 50% in our data.
The nuclear localization signal shows a strong statistical difference in residue frequencies
While the NLSs can be divided in many functional categories depending on their pathways, we observed that most NLSs had an enrichment of basic residues. We first analyzed the residue content of the characterized cNLSs. Consistent with the definition of classical NLSs, we observe significant frequency differences (see Methods) mainly in lysine (7.3% for the genome vs 29.3% for the cNLSs, P-value < 10-10) and arginine (4.4% for the genome vs 15.7% for the cNLSs, P-value < 10-10).
To determine whether there were differences between the different types of NLSs, We then compared the set of cNLS to all other NLS (non-cNLS and unknown) and surprisingly found no significant differences in their residue frequencies. Thinking that maybe the unknown set of NLSs might have been mostly composed of cNLSs, we also compared the cNLS residue frequencies to the non-cNLS residue frequencies and also found no significant differences in residue frequencies. Thus, in our data set the three categories of NLSs (cNLS, non-cNLS and unknown) show no difference in residue frequencies.
Interestingly, we also observe two groups of NLSs on this plot, the first showing an average length of 8 amino acids containing an average of 3.9 basic residues while the other showing an average length of 20 amino acids containing an average of 7.3 basic residues. We suggest that these regions correspond to monopartite and bipartite NLSs . We note that even NLSs known not to be importin-α dependent showed this pattern.
A simple Hidden Markov Model shows better predictive performance
We were surprised to observe that a single, simple model could achieve increased performance even though our training data spanned a large diversity of NLSs, e.g., some cNLSs and others not, both bipartite and monopartite. Nevertheless, we found that our method showed similar positive predictive value on each set of NLSs when we analyzed them individually (data not shown). While perhaps surprising, these results are consistent with the similar residue compositions of each sets of NLSs (Figure 3).
A more complex model does not improve prediction
To test these models, we defined bipartite and monopartite NLSs based on the two classes we had observed in Figure 3. We tested the bipartite model on the bipartite NLSs only. Surprisingly, the positive predictive value and true positive rate of the bipartite model was exactly the same as the combined model (TPR = 61%, same FPR), as the combined model can also identify bipartite NLSs (Figure 6b, d). Similar predictive power was also observed when comparing the monopartite model to the combined model on only monopartite NLSs. However, a small increase in true positive rate was observed for the monopartite model (TPR = 41% vs 38%, same FPR) but this difference was not significant.
Applying the model to other organisms
Seeking to understand the elevated rate of false predictions, we explored them further and found that 7 of the 15 vertebrate false positives occur on only three proteins. When we searched the literature for more information on these proteins we found that one of our 'false' predictions had actually been identified as an NLS [16, 17]. We identified on another protein two patches of amino-acid repeats, and finally on the last protein two nucleolar localization signals, which are long stretches of basic amino acid residues, and it is not yet clear if these signals should be considered as NLSs [18, 19].
We noted in total three false predictions composed of amino-acid repeats, which were largely absent from our yeast training set. Such repeats are problematic because they violate an assumption of our simple hidden-Markov model, namely, that adjacent residues are independent given the value of the hidden state.
Taken together, this analysis of vertebrate NLSs demonstrates that the true positive rate of our method is not confined to our training data set. However, we did note an increase in false positives, which might be explained by additional undiscovered NLSs in this data set, or repetitive sequences in vertebrate proteins.
Analysis of residue frequency clearly shows that there is a bias in key residues in the NLS, and while previous studies have shown that there exists some position requirement in these residues , the NLSs do not clearly align and therefore we believe that the signal simply do not obey a clear consensus sequence rule. Our results are consistent with the model that NLSs may simply be regions of high positive charges with only minor spatial requirements . This idea is consistent with the model that the NLS can be masked by phosphorylation , by inhibiting the activity of the signal due to addition of negative charges.
There is abundant evidence that cNLSs bind at specific positions on importin-α and this binding is mediated by two sites which bind monopartite or bipartite cNLSs [22, 23], which suggests that the bipartite cNLS is not simply a bigger monopartite cNLS. We were, however, unable to exploit this knowledge to create a stronger predictor. While each basic patch in the bipartite cNLS does not represent a monopartite cNLS, the combination of both patches is sufficient for recognition by our predictor.
By examining some false negatives, we find at least two reasons why our method fails to predict them. First, there are examples that do not seem to show an enrichment of basic residues. Second, other NLSs are simply too small to be identified reliably, due to a lack of statistical signal.
We note that unlike PSORT which is based on the cNLS consensus motifs, our method does not attempt to predict NLSs belonging to a particular pathway. We observed that all NLSs show an excess of basic residues, and therefore we have developed a simple HMM that can identify stretches of basic residues in protein sequences. It is somewhat surprising that this method performs reasonably well. This suggests that the consensus patterns for different import pathways may be more similar than currently anticipated.
We believe that obtaining a higher positive predictive value is certainly possible by combining other biological knowledge; however, this is still under study. For example, some of our false predictions might either be real NLSs that have not yet been characterized, or simply part of other undiscovered protein domains.
In conclusion, we offer a simplified approach to predicting nuclear localization signals and show that this method can be applied to multiple species. This agrees with the notion that important regulatory mechanisms are conserved, although it is possible that differences in positive predictive value can be attributed to the existence of multiple importin-α,β homologues in other species  or differences in proteins lengths between species.
Creation of alignments and assessment of consensus sequence based methods
Based on recent studies on the binding sites of the importin-α , we created an alignment of our cNLS based on the major site of the importin (Figure 2). This alignment was run through HMMer  using HMMbuild and calibrated using HMMcalibrate. Using HMMsearch and a leave-one-out cross-validation , we assessed its predictive power. The leave-one-out cross-validation of the HMMer framework was done by removing all the characterized NLSs of one of the proteins from the alignment and then applying the above method on the protein.
Analysis of residue frequency in the nuclear localization signal and statistical significance
where a ij is the number of counts in ith row and jth column of a 2 × 2 table, R1, R2, C1 and C2 are the row and column totals, respectively, and N is the total number of observations. To calculate the two-tailed P-value using Fisher's exact test, we summed the probabilities of all the matrices below and equal to the P cutoff . If the P-value was below than 0.01 we considered that residue significant.
Creation of a Hidden Markov model
We created a simple two state HMM as well as a four state HMM to assess the difference between monopartite and bipartite motifs. The two state HMM receives two sets of states based on the background and NLS residue frequencies, while the four state HMM assumes that the spacer residue frequencies are equal to the 'background' residue frequencies. A four state HMM where the spacer residue frequencies were taken directly from our characterized data was also done (data not shown). Transition probabilities were approximated by summing the number of times that a transition occurs divided by the total amount of transitions from a given state in our data set. The starting state was assumed to be the same as the 'background' state. Further analysis using the Baum-Welch algorithm [26, 27] was used but did not yield significant predictive power difference. Outputs are both the most probable path and posterior probabilities. This HMM is publically accessible from our web page.
Where akl is the transition probability from state k to state l and Akl is the observed frequency of this transition in the characterized data.
The posterior probability is the probability that the ith amino acid in the given sequence (x) is produced by state k (πi = k). This is calculated using the forward algorithm, which is the probability of all the sequences up to and including the ith amino acid requiring that the ith amino acid is produced by state k (fk(i)), multiplied by probability of all sequences after the ith amino acid when the ithamino acid is produced by state k (bk(i)), given by the backward algorithm, and divided by the probability of the whole sequence given the model (P(x)).
Signal detection, optimizing true positive rate and ROC curve
We used several thresholds on the posterior probability and any segment over the threshold is labelled as a signal for the two state model. The four state model instead sums the two NLS states as well as the spacer state and establishes a signal when this sum is over the threshold. Using signal/true motif overlaps, we assess the true positive rate and positive predictive value using a leave-one-out cross-validation . The leave-one-out cross validation was done systematically by removing a protein from the training set, and testing the predictor on this protein.
The rCC is the Pearson product-moment coefficient of correlation, where TP and TN are the number of true positives and true negatives, while FP and FN are the number of false positives and false negatives, taken at the nucleotide level as opposed to the signal level. Because our predictions are done on protein sequences, this was done at the residue level instead of nucleotides as is done for DNA motifs .
A Matthews' correlation coefficient is calculated similarly to the calculation of the rCC, however this is taken at the signal level and taking as the number of true negatives the number of predictions made at a threshold of 0.001. This is a similar method as used by predictNES [14, 15]. This number of true negatives was also used to assess significance of false positive rate differences in a Fisher's test as above.
True positive rate (TPR) is calculated as the number of true predictions divided by the number of total possible true positives, while the false positive rate (FPR) is calculated as the number of false positives divided by the sum of the length of all our characterized proteins. The false positive rate is therefore given at the residue level, while the true positive rate is given at the level of individual NLSs. Positive predictive value (PPV) is assessed as the ratio of true positive predicted NLSs to the total number of predicted NLSs.
The HMM-based prediction method, NLStradamus, for predicting nuclear localization signals can be accessed directly at: http://www.moseslab.csb.utoronto.ca/NLStradamus/. Both two state and four state HMMs are available for use and posterior decoding at different thresholds is also possible.
This work was supported by a research award from the Center for the Analysis of Genome Evolution and Function (CAGEF) to ANNB and by a Canadian Foundation for Innovation infrastructure grant to AMM. We thank members of NP's lab for providing us with valuable inputs during the beginning of the project, as well as providing us the web server, and Hardeep Nahal helped us set up our web application. We acknowledge members of CUBIC involved in the PredictNLS project, who kindly provided us a copy of their standalone program to help us do a prior analysis of the nuclear localization signal.
- Lange A, Mills RE, Lange CJ, Stewart M, Devine SE, Corbett AH: J Biol Chem. 2007, 282(8):5101–5. Epub 2006 Dec 14. Epub 2006 Dec 14. 10.1074/jbc.R600026200PubMed CentralView ArticlePubMedGoogle Scholar
- Poon IK, Jans DA: Regulation of nuclear transport: central role in development and transformation? Traffic 2005, 6(3):173–86. 10.1111/j.1600-0854.2005.00268.xView ArticlePubMedGoogle Scholar
- Rout MP, Blobel G: Isolation of the yeast nuclear pore complex. J Cell Biol 1993, 123(4):771–83. 10.1083/jcb.123.4.771View ArticlePubMedGoogle Scholar
- Panté N, Aebi U: The nuclear pore complex. J Cell Biol 1993, 122(5):977–84. 10.1083/jcb.122.5.977View ArticlePubMedGoogle Scholar
- Görlich D, Mattaj IW: Nucleocytoplasmic transport. Science 1996, 271(5255):1513–8. 10.1126/science.271.5255.1513View ArticlePubMedGoogle Scholar
- Moroianu J, Blobel G, Radu A: Nuclear protein import: Ran-GTP dissociates the karyopherin alphabeta heterodimer by displacing alpha from an overlapping binding site on beta. Proc Natl Acad Sci USA 1996, 93(14):7059–62. 10.1073/pnas.93.14.7059PubMed CentralView ArticlePubMedGoogle Scholar
- Nakai K, Horton P: PSORT: a program for detecting the sorting signals of proteins and predicting their subcellular localization. Trends Biochem Sci 1999, 24(1):34–35. 10.1016/S0968-0004(98)01336-XView ArticlePubMedGoogle Scholar
- Cokol M, Nair R, Rost B: Finding nuclear localization signals. EMBO Rep 2000, 1(5):411–5. 10.1093/embo-reports/kvd092PubMed CentralView ArticlePubMedGoogle Scholar
- Hahn S, Maurer P, Caesar S, Schlenstedt G: Classical NLS proteins from Saccharomyces cerevisiae. J Mol Biol 2008, 379(4):678–94. Epub 2008 Apr 22. Epub 2008 Apr 22. 10.1016/j.jmb.2008.04.038View ArticlePubMedGoogle Scholar
- Finn RD, Tate J, Mistry J, Coggill PC, Sammut JS, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A: The PFAM protein familes database. Nucleic Acids Research 2008, (36 Database):D281-D288.Google Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14(9):755–63. 10.1093/bioinformatics/14.9.755View ArticlePubMedGoogle Scholar
- Burset M, Guigó R: Evaluation of gene structure prediction programs. Genomics 1996, 34(3):353–67. 10.1006/geno.1996.0298View ArticlePubMedGoogle Scholar
- Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005, 23(1):137–44. 10.1038/nbt1053View ArticlePubMedGoogle Scholar
- Matthews BW: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 1975, 405(2):442–51.View ArticlePubMedGoogle Scholar
- la Cour T, Kiemer L, Mølgaard A, Gupta R, Skriver K, Brunak S: Analysis and prediction of leucine-rich nuclear export signals. Protein Eng Des Sel 2004, 17(6):527–36. Epub 2004 Aug 16. Epub 2004 Aug 16. 10.1093/protein/gzh062View ArticlePubMedGoogle Scholar
- Kaneko H, Orii KO, Matsui E, Shimozawa N, Fukao T, Matsumoto T, Shimamoto A, Furuichi Y, Hayakawa S, Kasahara K, Kondo N: BLM (the causative gene of Bloom syndrome) protein translocation into the nucleus by a nuclear localization signal. Biochem Biophys Res Commun 1997, 240(2):348–53. 10.1006/bbrc.1997.7648View ArticlePubMedGoogle Scholar
- Mirski SEL, Gerlach JH, Cole SPC: Sequence Determinants of Nuclear Localization in the α and β Isoforms of Human Topoisomerase II. Experimental Cell Research 1999, 251(2):329–339. 10.1006/excr.1999.4587View ArticlePubMedGoogle Scholar
- Dokal I: Dyskeratosis congenita in all its forms. Br J Haematol 2000, 110(4):768–79. 10.1046/j.1365-2141.2000.02109.xView ArticlePubMedGoogle Scholar
- Hatanaka M: Discovery of the nucleolar targeting signal. Bioessays 1990, 12(3):143–8. 10.1002/bies.950120310View ArticlePubMedGoogle Scholar
- Colledge WH, Richardson WD, Edge MD, Smith AE: Extensive mutagenesis of the nuclear location signal of simian virus 40 large-T antigen. Mol Cell Biol 1986, 6(11):4136–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Jans DA, Hübner S: Regulation of protein transport to the nucleus: central role of phosphorylation. Physiol Rev 1996, 76(3):651–85.PubMedGoogle Scholar
- Conti E, Uy M, Leighton L, Blobel G, Kuriyan J: Crystallographic analysis of the recognition of a nuclear localization signal by the nuclear import factor karyopherin alpha. Cell 1998, 94(2):193–204. 10.1016/S0092-8674(00)81419-1View ArticlePubMedGoogle Scholar
- Fontes MR, Teh T, Kobe B: Structural basis of recognition of monopartite and bipartite nuclear localization sequences by mammalian importin-alpha. J Mol Biol 2000, 297(5):1183–94. 10.1006/jmbi.2000.3642View ArticlePubMedGoogle Scholar
- Goldfarb DS, Corbett AH, Mason DA, Harreman MT, Adam SA: Importin alpha: a multipurpose nuclear-transport receptor. Trends Cell Biol 2004, 14(9):505–14. 10.1016/j.tcb.2004.07.016View ArticlePubMedGoogle Scholar
- Mosteller F, Tukey JW: Data analysis, including statistics. In Handbook of Social Psychology. 2nd edition. Edited by: Lindzey G, Aronson E. Reading: Addison-Wesley; 1968:80–203.Google Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press; 1998.View ArticleGoogle Scholar
- Baum LE, Petrie T, Soules G, Weiss N: A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. The Annals of Mathematical Statistics 1970, 41(1):164–171. 10.1214/aoms/1177697196View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.