The cell nucleus is a highly complex organelle that controls cell reproduction, differentiation and regulates cell metabolic activities. Cell nucleus is subdivided into several sub-compartments, called subnuclear locations, where proteins are located to function properly. If mislocated, protein malfunction would cause cell disease. In-depth information about subcelluar localization may help a full understanding of genomic regulation and function. As compared to the general subcelluar localization, subcelluar localization is more challenging from biological viewpoints [1]. From computational viewpoints, the characteristic difference (e.g. amino acid composition, phylogenetic history, etc.) among the proteins in nucleus is far less distinct than that among proteins from different macro cell compartments, thus making it hard to achieve satisfactory predictive performance. Shen H et al. (2007) [2] derived the PsePSSM feature representation from protein sequence by simply averaging the profile PSSM and combined the PseAA feature representation to construct a kNN ensemble classifier Nuc-PLoc. Nuc-PLoc divided nucleus into 9 subnuclear locations and achieved 67.4% overall accuracy. Lei Z et al. (2005) [1] directly used BLOSUM62 to derive the similarity between the k-mers from two protein sequences, based on which an ensemble of SVM was constructed with different k-mers to draw the final conclusion. The model divided nucleus into 6 subnuclear locations and achieved 50% overall accuracy. To further boost the performance, Lei Z et al. (2007) [3] incorporated GO information into the SVM Ensemble classifier and achieved 66.5% overall accuracy. The unavailability of GO annotation would restrict the model's applicability. For novel proteins or proteins with many missing GO terms, the predictive performance would be rather poor, maybe still about 50%. We can see that the prediction for subnuclear localization is more difficult than general subcelluar localization.
Machine learning methods for predicting protein subcelluar location should take into account two major factors, one is to derive protein feature information and the other is to design predictive model. State-of-art feature extraction is data- and model-dependent. We should guarantee that the features not only capture rich biological information but also should be discriminative enough to construct a classifier for prediction. High throughout sequencing technique makes protein sequence cheaply available. In computational proteomics, many computational models are based on protein primary sequence. On the other hand, data integration becomes a popular method to integrate diverse biological data, including non-sequence information, such as GO annotation, protein-protein interaction, etc.
There are many models that extract features from protein sequence. Amino acid composition (AA) has close relation with protein subcelluar localization [4] and is the most frequently-used features, usually used together with other information for protein subcelluar localization [5, 6]. Besides amino acid occurrence, pair-wise residue correlation and amino acid physiochemical properties are also incorporated to encode protein sequence, such as PseAA [7], Che-mAA [8], etc. Window-based k-mer histogram is another approach proposed to extract biological information from protein sequence, such as gapAA, dipeptide [6, 8], and motif kernel [5, 9], etc. AA is a special case for k-mer histogram when the window size equals 1. For k-mer histogram, the feature space expands exponentially with the window size k. To capture size-varying motif information and the context information around a specific amino acid residue, some approaches compress 20 amino acids into 7 groups according to amino acid physiochemical properties [10, 11]. At both ends of a protein sequence, maybe there exists some sorting signal or anchoring signal for protein subcelluar localization. Hoglund A et al. (2006) [5] combined N-terminal signal, overall protein amino acid composition and eMotif information into a unified profile vector representation (PPV), and used the feature vector to construct a hierarchical SVM classifier for protein subcelluar localization. Schneider G et al. (2004) [12] gave a review on machine learning models using signal peptide for protein subcelluar location prediction as of 2004.
Protein phylogenetic information is another source for protein subcelluar localization. Edward M et al., (2000) [13] used Blast to generate a protein's profile distribution over several reference species, and revealed that proteins in the same subcelluar location manifest similar phylogenetic profile distribution, while proteins in different subcelluar locations were distinctly distributed. Several models extracted features from PSI-Blast profile such as PSSM and PSFM [14, 15]. Mak M et al. (2008) [15] used PSI-Blast to generate the pro-file (PSSM & PSFM) for each query sequence, and derive a profile alignment kernel using dynamic programming to define two query sequences' similarity. Rangwala H et al. (2005) [16] used PSSM & PSFM to derive a string kernel for remote homology detection and fold recognition. The method calculated the profile similarity between all k-length fragments of consecutive amino acids to derive the similarity between two protein sequences, thus rather computationally intensive. Kuang R et al. (2005) [17] designed a profile kernel, a variant mismatch kernel [18], which allowed a k fragment to match its corresponding k-mer if the fragment fell within the positional mutation neighbourhood defined by k-mer self-entropy. Kuang R et al. (2009) [19] extended the profile kernel by simple kernel fusion for prediction of malaria degradomes. Besides profile information, domain is another source of evolutionary information that can be used for protein subcelluar localization. Richard M et al. (2002) [20] analyzed the domain co-occurrence pattern of eukaryotic proteins and found that proteins in the same subcelluar location have similar domain co-occurrence pattern. Some other researches used flat binary domain vector to represent protein [21]. In such a sparse high-dimensional representation, the information about domain content and partition boundary is discarded. Mei S et al. (2009) [22] proposed a multiple instance learning model to make use of the domain boundary information along protein sequence, where domain is regarded as an instance and the protein sequence is regarded as a bag. Ensemble learning is a commonly-adopted data integration method used to integrate heterogeneous data, such as GO annotation [23, 24], PPI network [19], etc. A little differently, Lee K et al. (2008) [8] concatenated the feature vectors from different data sources. The great challenge in those models is how to objectively estimate the model performance and how to predict a novel protein when neither GO annotation nor protein-protein interaction would be available. The model estimation was conducted only in the optimistic scenario that both training set and test set had GO or PPI information available. The published model performance may be overestimated. On the other hand, when GO or PPI information is unavailable, some base classifiers of the ensemble classifier would fail to work and may contribute nothing to novel protein prediction. So, it is worth discussing whether ensemble learning is fit for heterogeneous data integration.
However, kernel method can be used to fuse the heterogeneous information (GO/PPI information, etc.) by kernel matrix summation, with 0 filling the matrix for missing information. The expensive information can be used to tune SVM parameters, so that the knowledge contained in the expensively-acquired data can be transferred to the cheap data and the expensive information is not necessary for novel protein prediction. Kernel method has witnessed successful applications in computational biology in recent years [15–19, 25], where k-mer based kernels [16–19, 25] can be seen as variant spectrum kernel and mismatch kernel that incorporated protein sequence profile information. K-mer feature representation can capture the contextual information around an amino acid residue and cover conserved motifs. Alexander Z et al. [9] combined amino acid composition kernel and motif kernel using Multiple Kernel Learning (MKL) to automatically optimize the weights of kernel matrices. The optimal weights were derived using Semi-Infinite programming instead of convex Semi-definite programming to accelerate computation at the sacrifice of global optimum.
In this paper, we only use the amino acid information of protein sequence without any other information to design a widely-applicable model for protein subnuclear localization. We use K-spectrum kernel to exploit the contextual information around an amino acid and the conserved motif information. Besides expanding window size, we adopt various amino acid classification approaches to capture diverse aspects of amino acid physiochemical properties. Each amino acid classification generates a series of spectrum kernels based on different window size. Thus, (I) window expansion can capture more contextual information and cover size-varying motifs; (II) various amino acid classifications can exploit multi-aspect biological information from the protein sequence; (III) amino acid classification approaches can compress 20 amino acids to a certain content, so as to allow larger window size and reduce the dimensionality of feature space. Finally, we combine all the spectrum kernels by simple addition into one single kernel called SpectrumKernel+ for protein subnuclear localization.