An SVM-based system for predicting protein subnuclear localizations

Background The large gap between the number of protein sequences in databases and the number of functionally characterized proteins calls for the development of a fast computational tool for the prediction of subnuclear and subcellular localizations generally applicable to protein sequences. The information on localization may reveal the molecular function of novel proteins, in addition to providing insight on the biological pathways in which they function. The bulk of past work has been focused on protein subcellular localizations. Furthermore, no specific tool has been dedicated to prediction at the subnuclear level, despite its high importance. In order to design a suitable predictive system, the extraction of subtle sequence signals that can discriminate among proteins with different subnuclear localizations is the key. Results New kernel functions used in a support vector machine (SVM) learning model are introduced for the measurement of sequence similarity. The k-peptide vectors are first mapped by a matrix of high-scored pairs of k-peptides which are measured by BLOSUM62 scores. The kernels, measuring the similarity for sequences, are then defined on the mapped vectors. By combining these new encoding methods, a multi-class classification system for the prediction of protein subnuclear localizations is established for the first time. The performance of the system is evaluated with a set of proteins collected in the Nuclear Protein Database (NPD). The overall accuracy of prediction for 6 localizations is about 50% (vs. random prediction 16.7%) for single localization proteins in the leave-one-out cross-validation; and 65% for an independent set of multi-localization proteins. This integrated system can be accessed at . Conclusion The integrated system benefits from the combination of predictions from several SVMs based on selected encoding methods. Finally, the predictive power of the system is expected to improve as more proteins with known subnuclear localizations become available.


Background
The cell nucleus is a highly complex organelle that organizes the comprehensive assembly of our genes and their corresponding regulatory factors. Accordingly, the cell nucleus reflects the intricate regulation of various biological activities. Although protein complexes disperse throughout the entire organelle, it is known that many nuclear proteins participating in related pathways tend to concentrate into specific areas [1,2]. For example, the rDNA processing and ribosome biogenesis often occur within the nucleolus and the proteins responsible for presplicing appear to concentrate into multiple nuclear speckles, even while they are migrating in the nucleus. The confinement of biomolecules within specific compartments is crucial for the formation and function of the cell nucleus; in contrast, the mis-localization of proteins can lead to both human genetic disease and cancer [3].
Accordingly, information on protein subnuclear localization is essential for a full understanding of genomic regulation and function. Advances in experimental technology have enabled the large-scale identification of nuclear proteins. However, at the same time, the sequencing of both the human and mouse genomes has generated an enormous inventory of primary sequences with unknown functions. A faster and cheaper bioinformatics tool is required for the annotation of these exponentially accumulating sequences. A computational prediction of protein subnuclear compartments from primary protein sequences can provide important clues to the function of novel proteins.
The prediction of protein localization at the subnuclear level is challenging compared with that at the subcellular level. Three facts contribute to the difficulty: (1) proteins within the cell nucleus face no apparent physical barrier like a membrane [24]; (2) the nucleus is far more compact and complicated in comparison with other compartments in a cell [25]; and (3) protein complexes within the cell nucleus are not static [1,24,25]. Recent developments in live-cell imaging have revealed that nuclear processes may rely on a constant flow of molecules between dynamic compartments created by relatively immobile binding or assembly sites. As proteins diffuse through the nuclear space, they appear to alter their compartments during different phases of the cell cycle or accompanying differentiation [3]. For instance, some nucleolar proteins are continually exchanging between the nucleoplasm and the nucleolus. Proteomic studies have also highlighted the dynamic nature of the nucleolar proteome [3].
Employing the database Nuclear Protein Database (NPD) developed by Dellaire, Farrall and Bickmore [26], Bickmore and Sutherland [27] recently addressed the characteristics of the primary sequences of nuclear proteins, such as the molecular weight, isoelectric point, and amino acid composition for proteins in different subnuclear compartments. They also found that motifs and domains are often shared by proteins co-localized within the same subnuclear compartment. Furthermore, certain generally abundant motifs/domains are lacking from the proteins concentrated in some specific areas of the nucleus. Based on these findings, it should be possible to combine totality of this information in a manner that will enhance the prediction of compartmental-specific nuclear localizations of the protein constituents listed in genome databases. Encouraged by our previous success in the design of a metric for the biological similarity of protein sequences [22,23], a prediction system is developed based on support vector machines (SVMs), one of the most advanced machine learning methods [28,29]. The principal feature of our mode of analysis is the introduction of new kernel functions which are effective in capturing the subtle difference between sequences originated from two distinct nuclear compartments.

Results and Discussion
Normally, conventional k-peptide encoding vectors (k = 1, 2, 3) are used for the description of a protein sequence. Successful applications include (1) the protein fold recognition [30,31], and (2) the prediction of subcellular localization [5,7,16]. The basic concept of the new kernels proposed in our previous work [22,23] is the measurement of biological similarity for k-peptides, having either none or a few shared residues, with the incorporation of evolutionary information. Our finding indicates that the mapping of conventional k-peptide encoding vectors by a matrix formed with high-scored pairs of k-peptides can facilitate the construction of a suitable metric. The score of a pair of k-peptides is calculated by the BLOSUM scores of residues and, therefore, the evolutionary information of the residues is embedded into the sequence description. A related concept that links two k-peptides with a small number of mutated residues has been presented by Leslie et al. [32] for protein homology detection.
This study presents the performance of conventional kpeptide encoding methods and the new proposed kernels for the prediction of protein subnuclear compartments. Furthermore, with the use of the jury voting scheme developed in [31], an integrated system was built by combining binary prediction outcomes obtained from different sequence encoding schemes. The results demonstrate that the integrated system enhances the overall performance of the system.
The dataset used in this study was extracted from the Nuclear Protein Database (NPD) [26] using a Perl script. The NPD is a curated database that stores information on more than 1000 vertebrate proteins, chiefly from human and mouse, which are reported in the literature to be localized in the cell nucleus. Since certain proteins associate with more than one compartment, a test dataset consisting of proteins with multiple localizations was first extracted out. These proteins have the same SwissProt ID or Entrez Protein ID though localized in different compartments. This preparative procedure resulted in 92 proteins that are localized within the six compartments described below. The majority is localized in 2 compartments and the remaining portion is localized in 3 or 4 compartments.
After excluding the multi-localization proteins, a nonredundant dataset was further constructed by PROSET [33] to ensure low sequence identity (<50%). In order to have sufficient number of proteins for training and testing, only six localizations were selected for evaluation. These are PML BODY (38), Nuclear Lamina (55), Nuclear Splicing Speckles (56), Chromatin (61), Nucleoplasm (75), and Nucleolus (219). Each of these proteins has a single localization and the total number is 504.
It should be noted that the multi-localization proteins are not included in the set of 504 single-localization proteins for the leave-one-out cross-validation (LOOCV). Therefore, the multi-localization dataset is essentially an independent testing set. The summary of the datasets is presented in Table 1.
The evaluations of the predictive power of the methods were performed on the datasets. Since there are 6 localiza-tions in the dataset, the one-versus-one multi-class classification system led to 6*(6-1)/2 = 15 SVM models for one single encoding method (see Methods for details). Three encoding techniques corresponding to the conventional k-peptide composition and three encoding methods based on the new kernels were used for k = 1,2,3. SVM-Light [34] was used as the SVM solver.
The overall accuracy for the multi-class classification proposed by Rost and Sander [35] was used for the evaluation of our system. Suppose there are m = m 1 + m 2 + ... + m N test proteins, where m i is the number of proteins belonging to class i(i = 1,...,N). Suppose further that out of the proteins considered, p i proteins are correctly predicted to belong to class i. Then p = p 1 + p 2 + ... + p N is the total number of correctly predicted proteins. The accuracy for class i is and the overall accuracy, denoted by Q acc , is defined as Note that acc i and Q acc are respectively corresponding to the definitions of and Q total in Rost and Sander [35]. Since the numbers of proteins for various localizations are unbalanced, the Matthew's correlation coefficient (MCC) was also employed for the optimization of parameters and evaluation of performance [36]: where p i is the number of correctly predicted proteins of the location i, s i is the number of correctly predicted pro- teins not in the location i, u i is the number of under-predicted proteins, and o i the number of over-predicted proteins.
In order to evaluate the performance of the system for multi-localization proteins, the criterion proposed in Gardy et al. was used [17]. More specifically, for a protein with multi-localization, if the system validly predicts one of the locations, then the entire prediction is considered correct. It should be noted that this criterion overestimates the performance. Since our method can only predict one localization for a given protein, other evaluation methods for multi-localization proteins such as the one proposed by Chou and Cai [14,18] can not be applied.
The performances for each encoding method and the combined encoding methods are shown in Table 2 and Table 3, respectively. The results for the single-localization proteins were obtained from the LOOCV procedure; and the results for the multi-localization proteins were obtained from the final prediction system. Overall, the single encoding methods gave an accuracy of prediction Q acc that ranged from 47.8% to 51.4% for single-localization proteins and from 57.6% to 64.1% for multi-localization proteins. The corresponding average MCCs ranged from 0.203 to 0.276 for single-localization proteins and from 0.182 to 0.401 for multi-localization proteins. The combination of the new encoding methods D 1 X 1 , D 2 X 2 , and D 3 X 3 with the use of jury voting yielded an improved performance for MCC. For example, the average MCC was elevated from 0.266-0.276 to 0.284 for single-localization proteins and from 0.362-0.401 to 0.420 for multilocalization proteins. The change in Q acc was not uniform: it decreased from the highest value 51.4% to 50.0% for single-localization protein and increased from 64.1% to 65.2% for multi-localization proteins. The combination of the conventional k-peptide compositions AA, DI, and TRI did not demonstrate significant improvement. Further optimization of the parameter for the determination of sparsity of matrix D 3 is likely to enhance the performance of the prediction system.
The final models for the prediction system are the combination of the new encoding methods D 1 X 1 , D 2 X 2 , and D 3 X 3 , since adding any conventional k-peptide encoding method does not improve the performance of the system. The predictions for all the 92 multi-localization testing proteins are detailed in Table S1 in the supplementary file [see Additional file 1].

Conclusion
An SVM-based multi-class classification system has been developed for the prediction of protein subnuclear localizations. This is the first system designed specifically for this task. This system, which integrates predictions from three new encoding methods, achieves encouraging levels of accuracy for six specific subnuclear localizations. However, compared to the prediction of protein localizations at the subcellular level, the corresponding prediction at the subnuclear level is far more challenging. This difficulty arises mainly from the biological fact that each compartment within the cell nucleus contains no apparent physical barrier like a membrane. Furthermore, the nucleus is a considerably more compact and complex organelle in comparison to other organelles in the cell. Finally, the dynamic nature of the nucleolar proteome adds an additional level of complexity to the task of prediction.

Kernels based on high-scored pairs of k-peptides
Recently, Lei and Dai proposed new kernels based on high-scored pairs of k-peptides for protein sequence encoding [22,23] for the SVMs. Superior performance of the SVMs with these new kernels was demonstrated through application to the prediction of protein subcellular localization. The kernels proposed in [22,23] can be described as follows.
A matrix D k of high scored k-peptide pairs is defined with a prescribed threshold. Each entry is associated with the BLOSUM score of some pair of k-peptides. The matrix is of dimension 21 k × 21 k , where 21 is the number of amino acid symbols (normal 20 amino acids plus the special symbol ''X''). The thresholds are set to zeroes for k = 1, 2. Therefore, matrix D 1 is the same as the BLOSUM matrix, except that the entries with negative values are replaced by zeroes; the entries of matrix D 2 are the BLOSUM pair scores of two di-peptides with all negative values being replaced by zeroes. Since the size of D 3 is very large and the majority of all possible pairs is associated with lower scores, the elimination of those pairs can reduce noise that may confuse the prediction. Therefore, a careful thresholding is necessary to ensure the sparsity of the matrix D 3 .
In this work, the threshold is set to 8 for k = 3. For example, the score is 12 for an AAA-AAA pair, 11 for an AAY-ACY pair, and 0 for a TVW-TVR pair since TVW-TVR BLOSUM62 pair-score is 6, which is smaller than the threshold value 8. Given the dimensional scaling, when k > 3, such a coding scheme is less attractive from a computational point of view.
It can be considered as a Gaussian kernel for a pair of vectors D k x ki and D k x kj . These kernels define the sequence similarity for the mapped vectors D k x ki and D k x kj , not directly for the k-peptide composition vectors x ki and x kj . In this study, the kernel type used for the conventional k-peptide composition encoding methods is the radial basis kernel: exp(-γ || x ki -x kj || 2 ) In the following, the concept described above is illustrated and the comparison with the conventional k-peptide encoding method is provided. Consider two short amino acid sequences AAACY and AACCY. Using the input format of the SVMLight [34], the conventional tri-peptide encoding method generates two coding vectors: where the numbers appearing in the vectors are in the format of "index: score". It is obvious that the two sequences share the tri-peptide "AAC", and the corresponding vector index is 2. On the other hand, using BLOSUM62, the transformed vectors D 3 x 31 for x 31 and D 3 x 32 for x 32 are calculated as follows: From the list it is seen that the transformed vectors share more common indices, such as 1, 2, 22-28 etc. Therefore, the similarity between the two sequences is more likely to be captured by the new methods even they do not share explicitly those tri-peptides. The mismatch string kernels proposed in Leslie et al. [32] also consider the similarity between mismatch k-peptides. For example, compared with the conventional tri-peptide encoding, the two sequences share several more common tri-peptides, such as AAA and AAC, AAC and ACC, ACY and CCY, if one mismatch is allowed in two peptides. Therefore, our method is related to the mismatch string kernel but it is different.

Multi-class classification system
The efficient extension of SVMs to the handling of multiple classes has been achieved for applications to protein fold prediction [30] and the prediction of subcellular localization [7,16]. The one-versus-one [37] framework was used here for the assembly of the multi-class classifier from binary classifiers. For a classification problem of N class, it trains every pair-wise binary classifier. This gives a total of 1/2 * N (N -1) classifiers. The prediction of the label of a testing protein follows the jury voting; specifically, sum the predictions for each classifier and take the label with the highest votes. When ties arise, the class label is assigned to the class with the maximum value of the sum of the function margins. This jury voting scheme is very flexible for the assembly of the predictions obtained from various SVM models. It can integrate not only the outcome from binary predictors with one encoding scheme, but also those obtained from alternative encoding methods. Accordingly, the class label of the testing protein is assigned to the class with the maximum votes.

Cross-validation and final prediction system
The generalization performance of an SVM is controlled by the following parameters: (1) C: the trade-off between the training error and class separation; (2) γ: the parameter in the radial basis functions exp(-γ || (3) J: the biased penalty for errors from positive and negative training points.
The leave-one-out cross-validation (LOOCV) was employed for the evaluation. The LOOCV is also referred as jackknife test, which is considered to be more rigorous and reliable compared with other testing techniques. A justification of the rigorousness and reliability of the LOOCV can be found, e.g., in Chou and Zhang [38]. Assume that there are overall m proteins. Each protein was in turn considered as a testing protein and the parameters associated with the SVM model were optimized based on a 5-fold cross-validation by using the remaining m -1 proteins. The criterion of the optimization is the sum of the Matthew's correlation coefficients over all classes [36]. The final LOOCV classifiers were determined by using the optimized parameters to train the set of the m -1 proteins. The search ranges corresponding to the parameters in the 5-fold cross validation optimization are the following: (1) C: 2 -2 , 2 -1 , 1, ..., 2 9 , 2 10 ; (2) γ: 2 -15 , 2 -14 , 2 -13 , ..., 2 14 , 2 15 ; (3) J: 1, 2, 3, ..., 8, 9. The labels of the training sets were arranged in a way that the size of the negative set is always larger than that of the positive set in our experiment. Here, the penalty term in the SVM is split into two terms: . The heavier weight CJ imposed on the errors originating from the negative points enforces a low false positive rate for unbalanced training sets [39]. The final prediction system was constructed as follows.
The entire set of proteins with single-localization was used as a training set; and the optimal value for each parameter of the SVMs for the training set was taken as the average value of the optimal parameters obtained from the LOOCV procedure. Using these optimized parameters, final binary classifies were learned from the training set. The evaluation for the set of multi-localization proteins was based on this final prediction system. The framework for the overall training and testing procedures is illus-