Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing
© Su et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Skip to main content
© Su et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Identification of subcellular localization in proteins is crucial to elucidate cellular processes and molecular functions in a cell. However, given a tremendous amount of sequence data generated in the post-genomic era, determining protein localization based on biological experiments can be expensive and time-consuming. Therefore, developing prediction systems to analyze uncharacterised proteins efficiently has played an important role in high-throughput protein analyses. In a eukaryotic cell, many essential biological processes take place in the nucleus. Nuclear proteins shuttle between nucleus and cytoplasm based on recognition of nuclear translocation signals, including nuclear localization signals (NLSs) and nuclear export signals (NESs). Currently, only a few approaches have been developed specifically to predict nuclear localization using sequence features, such as putative NLSs. However, it has been shown that prediction coverage based on the NLSs is very low. In addition, most existing approaches only attained prediction accuracy and Matthew's correlation coefficient (MCC) around 54%~70% and 0.250~0.380 on independent test set, respectively. Moreover, no predictor can generate sequence motifs to characterize features of potential NESs, in which biological properties are not well understood from existing experimental studies.
In this study, first we propose PSLNuc (Protein Subcellular Localization prediction for Nucleus) for predicting nuclear localization in proteins. First, for feature representation, a protein is represented by gapped-dipeptides and the feature values are weighted by homology information from a smoothed position-specific scoring matrix. After that, we incorporate probabilistic latent semantic indexing (PLSI) for feature reduction. Finally, the reduced features are used as input for a support vector machine (SVM) classifier. In addition to PSLNuc, we further identify gapped-dipeptide signatures for putative NLSs and NESs to develop a prediction method, PSLNTS (Protein Subcellular Localization prediction using Nuclear Translocation Signals). We apply PLSI to generate gapped-dipeptide signatures from both nuclear and non-nuclear proteins, and propose candidate sequence motifs for putative NLSs and NESs. Then, we incorporate only the proposed gapped-dipeptide signatures in an SVM classifier to mimic biological properties of NLSs and NESs for predicting nuclear localization in PSLNTS.
Experiment results demonstrate that the proposed method shows a significant improvement for nuclear localization prediction. To compare our predictive performance with other approaches, we incorporate two non-redundant benchmark data sets, a training set and an independent test set. Evaluated by five-fold cross-validation on the training set, PSLNuc attains an overall accuracy of 79.7%, which is 4.8% improvement over the state-of-the-art system. In addition, our method also enhances the MCC from 0.497 to 0.595. Compared on the independent test set, PSLNuc outperforms other predictors by 3.9%~19.9% on accuracy and 0.077~0.207 on MCC. This suggests that, in addition to NLSs, which have been shown important for nuclear proteins, NESs can also be an effective indicator to detect non-nuclear proteins. Most notably, using only a few proposed gapped-dipeptide signatures as input features for the SVM classifier, PSLNTS further enhances the accuracy and MCC to 80.9% and 0.618, respectively. Our results demonstrate that gapped-dipeptide signatures can better discriminate nuclear and non-nuclear proteins. Moreover, the proposed gapped-dipeptide signatures can be biologically interpreted and used in further experiment analyses of nuclear translocation signals, including NLSs and NESs.
In the eukaryotic cells, many essential biological processes take place in the nucleus. Nuclear localization is a complicated set of processes that play a crucial role in the dynamical self-regulation of the cell . To participate in the cell regulation processes, proteins are translocated in and out of nucleus. This import and export are mediated by short binding sites on the protein sequence, called nuclear localization signals (NLSs) and nuclear export signals (NESs). Both NLSs and NESs have been used as important features to detect nuclear proteins. However, due to a tremendous amount of protein sequences generated from the post-genomic era, NLSs and NESs are not yet well understood from existing experiments by the biologists, and so the set of currently known NLSs and NESs may be incomplete. Therefore, developing computational methods to identify potential NLSs and NESs has become highly desirable to predict nuclear localization.
At present, only a few predictors are designed specifically to identify proteins imported into the nucleus. PredictNLS  predicts nuclear proteins based on the presence of known or putative NLSs derived from the contents of NLSdb. NucPred  uses regular expression matching and multiple program classifiers induced by genetic programming to detect putative NLSs. NUCLEO  incorporates sequence motifs from known NLSs in a support vector machine (SVM) classifier for predicting nuclear localization. NpPred  applies SVM classifiers and hidden Markov models (HMM) using k-peptide composition and achieves high accuracy based on a data set, in which proteins are filtered at 90% sequence identity . Although general localization prediction methods provide comprehensive information, they do not consider compartment-specific features to optimize for a particular localization site . Besides the above predictors designed to predict nuclear localization proteins, several methods, such as NLStradamus, NetNES, and NoD, have been proposed to detect NLSs and NESs. NLStradamus  uses HMMs to predict NLSs in proteins, and NetNES  predicts NESs using neural network and HMMs. In addition, NoD  applies artificial neural network algorithm to detect nucleolar localization sequences in eukaryotic and viral proteins. Moreover, several methods [10–13] have been developed to further classify nuclear proteins according to their subnuclear localizations. In this study, we will propose a method to improve nuclear localization prediction based on potential NLSs and NESs generated from our analysis of gapped-dipeptide signatures.
Prediction of nuclear proteins presents several challenges. First, methods that integrate biological features only from known or putative NLSs could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize NESs from nuclear exported proteins. Second, several predictors are implemented on redundant training sets, which might lead to overestimation of the predictive performance. Thus, the performance would be significantly lower if redundant sequences were meticulously removed (e.g., at 25% sequence identity or even less) . Meanwhile, the performance of amino acid composition-based and sequence homology-based methods might be significantly degraded if homologous sequences are not detected . In addition, the k-peptide feature representation from amino acid composition-based methods can result in a very large feature dimension during the machine learning procedure, in which an effective feature reduction is highly desirable to reduce dimension. Finally, results of these two types of methods are generally difficult to interpret; therefore, it is difficult to determine which biological features should be used to identify nuclear or non-nuclear proteins and why they work well for prediction. If the features were biologically interpretable, the resultant knowledge could help in designing artificial proteins with the desired properties.
In this study, we first present a method, PSLNuc (Protein Subcellular Localization prediction for Nucleus), for predicting nuclear localization in proteins. For feature representation, sequence homology information from a smoothed position-specific scoring matrix (PSSM) is incorporated to calculate the weights of gapped-dipeptides. After that, probabilistic latent semantic indexing (PLSI) is used for feature reduction. Finally, the reduced features are applied as input vectors for an SVM classifier. In addition to PSLNuc, we further generate gapped-dipeptide signatures for potential NLSs and NESs, and develop another prediction method, PSLNTS (Protein Subcellular Localization prediction using Nuclear Translocation Signals). To propose candidate sequence patterns of putative NLSs and NESs, we apply PLSI to generate gapped-dipeptide signatures from both nuclear and non-nuclear proteins. Then, we further incorporate only the proposed gapped-dipeptide signatures in an SVM classifier to mimic biological properties of NLSs and NESs in PSLNTS.
Experiment results show that PSLNuc achieves high prediction accuracy, which demonstrates that homology information of gapped-dipeptides reduced by PLSI can significantly enhance the performance. Our analysis suggests that, in addition to NLSs, which have been shown important for nuclear proteins, NESs could also be an effective indicator to detect non-nuclear proteins. Most notably, the overall accuracy of PSLNTS is further improved to 0.809 using only the proposed gapped-dipeptide signatures. This implies that gapped-dipeptide signatures can better discriminate nuclear and non-nuclear localization. In addition, since sequence redundancy tends to overestimate the predictive performance, we incorporate non-redundant data sets and show the general accuracy of nuclear prediction should be approximately 0.800. Finally, since the proposed gapped-dipeptide signatures are biologically interpretable, they can be easily applied to advanced analyses and experimental designs of nuclear translocation signals.
Training Data Set (by five-fold cross-validation)
Independent Test Data Set
Gapped-dipeptide signatures for nuclear proteins
Gapped-dipeptide signatures for nuclear proteins
Gapped-dipeptide signatures for non-nuclear proteins
Gapped-dipeptide signatures for non-nuclear proteins
Performance comparison using gapped-dipeptide signatures
Training Set (by five-fold cross-validation)
In this study, we first incorporate gapped-dipeptides weighted by a smoothPSSM encoding and reduced by PLSI to predict nuclear localization in PSLNuc. Our results show that PSLNuc significantly improves the predictive performance compared to the state-of-the-art system. Experiment results also suggest that, in addition to NLSs, which have been shown important for nuclear proteins, NESs can also be an effective indicator to detect non-nuclear proteins. Secondly, we apply only a few proposed gapped-dipeptide signatures in PSLNTS and further enhance the accuracy and MCC to 0.809 and 0.618, respectively. This demonstrates that gapped-dipeptide signatures can better discriminate nuclear and non-nuclear localization. Most notably, the proposed gapped-dipeptide signatures could be biologically interpreted and used in further experiment studies of nuclear translocation sequences, including NLSs and NESs.
In this study, we extend our previously proposed method for general localization site prediction  and formulate the nuclear protein prediction as a document classification problem. In our previous work, we incorporated gapped-dipeptides as feature representation and PLSI as feature reduction for predicting protein localization sites. The method was inspired by word representation and word vector reduction in the research of document classification, where a document is assigned to one or several categories according to its content. Similarly, prediction of nuclear localization can be formulated as a document classification problem, in which a protein sequence is regarded as a document's content and its subcellular localization classes can be treated as categories of the document. Classification of documents is often solved in steps described as follows. First, for feature representation, each document is represented by a feature vector, where each word denotes a feature and its feature value represents the weight of a word in the document. Second, due to a high-dimensional feature space of words in a document, features are further reduced to enhance prediction accuracy and prevent overestimation . Finally, reduced features are incorporated as input vectors in machine learning approaches for predicting document categories. To calculate the weights of each word, a standard PSSM encoding was incorporated to predict general protein localization in our previous study . However, it has been shown that a smoothPSSM encoding scheme is more effective to predict protein structure and function . In this study, we incorporate a new smoothPSSM encoding scheme and extend our approach to predict nuclear localization for bioinformatics analysis. The task is a large-scale analysis of nuclear proteins, including prediction of nuclear localization and analyses of nuclear translocation signals. To solve these problems, we propose a prediction method in which proteins are represented by gapped-dipeptides from smoothPSSM and PLSI is incorporated for feature reduction. Next, the feature representation, feature weighting, feature reduction, system architecture, and evaluation measures are described in the following sections.
When proteins are considered as documents, several types of word representation have been used, such as amino acid compositions  and k-peptide compositions . Specifically, a dipeptide (k = 2) composition can be considered as a bi-gram word representation. However, peptides with gaps cannot be represented by a k-peptide composition. In addition, feature vectors with high dimensions could be generated if k-peptide compositions are used to represent remote sequence information. For instance, the dimension of a feature vector reaches up to 8,000 if a tri-peptide composition is considered. To distinguish nuclear and non-nuclear proteins, we incorporate gapped-dipeptide representation used in our previous work  to represent sequence features in proteins. A gapped-dipeptide AdB represents the sequence patterns that two amino acids A and B are divided by d residues. When we consider gapped-dipeptides up to u gapped distance, the feature dimension of gapped-dipeptides in a protein is the total number of probable combinations, that is 20 × 20 × (u+1). For instance, when u = 16, a feature vector of 6,400 (= 20 × 20 × 16) dimension is used to represent a protein sequence. Due to computational and time complexity, tri-peptides or other k-peptides are not considered in this study.
where x is a log-likelihood in a smoothPSSM.
in which sf(i, A) represents a normalized log-likelihood in the smoothPSSM element of the ith row and the Ath column. From the sample sequence, the weighting of M2I using a smoothPSSM profile is calculated as sf(1, M) × sf(4, I) + sf(2, M) × sf(5, I) + ... + sf(7, M) × sf(10, I). A protein is denoted as a feature vector consisting of gapped-dipeptides, where each gapped-dipeptide is weighted by TFsmoothPSSM encoding scheme. Finally, the feature vector is normalized to a range of 0 to 1.
in which P(w|t) represents the conditional probability of a word w conditioned on a topic t, and P(t|d) represents the weighting of a topic variable t in a document d. It is assumed that the word distribution given a topic class is conditionally independent of the document d, i.e., P(w|t, d) = P(w|t). Therefore, the original feature dimension |W| of the word vector is greatly reduced to the number of latent topic variables |T|.
We incorporate the same procedure as described in our previous work  to train and test PLSI model. First, for PLSI training, the parameters P(w|t) and P(t|d) are fitted by an iterative expectation-maximization algorithm, in which P(t|d) is estimated in the expectation (E) step and P(w|t) is recalculated in the maximization (M) step. Then, after training, the calculated probability of a word conditioned on a topic P(w|t) is used to estimate the P(t|d') for new documents d' through a folding-in process  in PLSI testing.
The application of PLSI can not only reduce feature dimension, but also extract semantic relationships of gapped-dipeptides. During PLSI feature reduction process, gapped-dipeptides with similar meanings or preferences are grouped together in a semantic topic, and then the topic preferences to nuclear or non-nuclear localization can be identified. If we can select an appropriate topic size in feature reduction, the mappings of feature vectors from the gapped-dipeptide space to latent semantic topic space can greatly increase the learning efficiency and performance. One way to approximate the reduced feature size is based on latent semantic indexing (LSI). First, singular values of LSI are calculated and sorted in a decreasing order. After that, we select t as the reduced feature size of LSI if the t-th largest singular value is close to zero. Although PLSI is not identical to LSI, the number of singular values larger than zero is reasonably estimated by the number of the PLSI reduced dimensions. We take the reduced number of topics as 80 according to our previous study .
Perform PSI-BLAST to calculate a standard PSSM for the protein.
Construct a smoothPSSM profile based on the standard PSSM.
Use gapped-dipeptides to represent the protein and incorporate TFsmoothPSSM encoding scheme to determine the weights in a feature vector.
Apply PLSI to reduce the feature vector.
Use the reduced feature vector as input and run the one-versus-one (nuclear and non-nuclear) SVM classifier.
For localization-topic preference identification, we divide the training data sets into nuclear and non-nuclear proteins to examine preferred topics. The localization-topic preference of a topic is computed as the average of topic weights from the proteins in a localization site. A topic is identified as showing preference to a localization site if its localization-topic preference is larger than the other site. For nuclear localization prediction, we divide the proteins into two categories, and select 10 top preferred topics for nuclear and non-nuclear proteins according to localization-topic preference, respectively. To list gapped-dipeptides of interest, for each topic, up to 20 (depending on the number gapped-dipeptides in the topic) most frequent gapped-dipeptides are selected. After that, we incorporate only the proposed gapped-dipeptide signatures for nuclear and non-nuclear proteins for predicting nuclear localization. We apply the proposed gapped-dipeptide signatures to capture biological properties and mimic translocation mechanisms of nuclear translocation.
where TP, TN, FP, FN, and N represent the number of true positives, true negatives, false positives, false negatives, and total number of protein sequences, respectively. For an objective comparison with other approaches that use five-fold cross-validation, we also apply five-fold cross-validation to evaluate our predictive performance.
We thank Hua-Sheng Chiu and Allan Lo for helpful suggestions and computational assistance. The research was supported in part by Taipei Medical University under grant TMU98-AE1-B05 and National Science Council under grant NSC99-2218-E-038-002 and NSC100-2221-E-038-012 to Emily Chia-Yu Su. JMC is funded by "la Caixa" pre-dotocorl fellowship and the Centre de Regulacio Genomica (CRG), the Plan Nacional (BFU2008-00419) from the Spanish Ministry of Science.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.