Volume 14 Supplement 8
Proceedings of the 2012 International Conference on Intelligent Computing (ICIC 2012)
Protein localization prediction using random walks on graphs
 Xiaohua Xu†^{1}Email author,
 Lin Lu†^{1},
 Ping He^{1} and
 Ling Chen^{1}
DOI: 10.1186/1471210514S8S4
© Xu et al.; licensee BioMed Central Ltd. 2013
Published: 9 May 2013
Abstract
Background
Understanding the localization of proteins in cells is vital to characterizing their functions and possible interactions. As a result, identifying the (sub)cellular compartment within which a protein is located becomes an important problem in protein classification. This classification issue thus involves predicting labels in a dataset with a limited number of labeled data points available. By utilizing a graph representation of protein data, random walk techniques have performed well in sequence classification and functional prediction; however, this method has not yet been applied to protein localization. Accordingly, we propose a novel classifier in the site prediction of proteins based on random walks on a graph.
Results
We propose a graph theory model for predicting protein localization using data generated in yeast and gramnegative (Gneg) bacteria. We tested the performance of our classifier on the two datasets, optimizing the model training parameters by varying the laziness values and the number of steps taken during the random walk. Using 10fold crossvalidation, we achieved an accuracy of above 61% for yeast data and about 93% for gramnegative bacteria.
Conclusions
This study presents a new classifier derived from the random walk technique and applies this classifier to investigate the cellular localization of proteins. The prediction accuracy and additional validation demonstrate an improvement over previous methods, such as support vector machine (SVM)based classifiers.
Background
Protein localization is a general a term that refers to the study of where proteins are located within the cell. In many cases, proteins cannot perform their designated function until they are transported to the proper location at the appropriate time. Improper localization of proteins can exert a significant impact on cellular processes or on the entire organism. Therefore, a central issue for biologists is to predict the (sub)cellular localization of proteins[1–3], which has implications for the functions and interactions[4, 5] of proteins.
With the development of new approaches in computer science, coupled with an improved dataset of proteins with known localization, computational tools can now provide fast and accurate localization predictions for many organisms as an alternative to laboratorybased methods. Therefore, many studies have begun to address this issue. To predict the cellular localization of proteins, soon after their proposal of a probabilistic classification system to identify 336 E.coli proteins and the 1484 yeast proteins [6], Paul Horton and Kenta Nakai [7] also compared their specifically designed probabilistic model with three other classifiers on the same datasets: the knearestneighbor (kNN) classifier, the binary decision tree classifier, and the naive Bayes classifier. The resulting accuracy using stratified crossvalidation showed that the kNN classifier performed better than the other methods, with an accuracy of approximately 60% for 10 yeast classes and 86% for 8 E. coli classes.
Feng [8] presented an overview about the prediction of protein subcellular localization, and in 2004, Donnes and Hoglund [9] introduced past and current work on this type of prediction as well as a guideline for future studies. Chou and Shen [10] summarized the more recent advances in the prediction of protein subcellular localization up to 2007. A variety of artificial intelligence technologies [11–15] have now been developed, including neural networks, the covariant discriminate algorithm, hidden Markov models (HMMs), Decision Tree and support vector machines (SVMs). Among these methods, the SVMs are always considered as a powerful algorithm for supervised learning.
Besides, there are other methods proposed too, like the YLoc tool implemented by Briesemeister et al. [16] and the PROlocalizer [17] which integrated web service to aid the prediction. Recently, the randomwalkongraph technique [18–20] has been applied to biological questions such as the classification of proteins into functional and structural classes based on their amino acid sequences. Weston et al. presented a randomwalk kernel based on PSIBLAST Evalues [21] for protein remote homology detection. Min et al. [22] applied the convex combination algorithm to approximate the randomwalk kernel with optimal random steps and applied this approach to classify protein sequence. Freschi et al. [23] proposed a random walk ranking algorithm to predict protein functions from interaction networks. Random walks are closely linked to Markov chains, which inspired Yuan [24] to apply a firstorder Markov chain and extend the residue pair probability to higherorder models to predict protein subcellular locations. Garagea et al. [25] also presented a semisupervised method for prediction using abstraction augmented Markov models.
This study introduces a novel random walk method for protein subcellular localization based on amino acid composition. By mapping the protein data into a weighted and partially labeled graph where each node represents a protein sequence, we implemented a random walk classification model to predict labels of unlabeled nodes based on our previous theoretical work [26]. We present an intuitive interpretation of the graph representation, label propagation and model formulation. We additionally analyzed the performance of the method in predicting the (sub)cellular localization of proteins. This method produced results that were both competitive and promising when compared to the stateoftheart SVM classifier.
Results
To prove the effective classification performance of our method, we compared our classifier with RBFSVM by implementing LibSVM [27], and the γ = 1/2σ^{2} of our RaWa and RBFSVM was optimized over the interval {2^{11}, 2^{9}, ..., 2^{9}, 2^{11}}. In this study, we adopted an nfold crossvalidation measurement to produce the highest predication accuracy, which was computed by dividing the number of correctly classified data points by the size of the entire unlabeled dataset.
Predicting the (sub)cellular localization of proteins
Since our classifier involved two parameters, the laziness parameter α for constructing transition matrix and the random walk step t, we first tested the performance of our classifier on different combinations of α and t. Then, under the optimized parameter settings, we compared our approach with various measurements to the SVM classifier.
Influence of α and t
We found that a large number of steps were unnecessary for the RaWa classifier to achieve the best results. First, the complete graph offers each label a chance to reach the unlabeled node in at least one step. Second, both figures show that good accuracy was always obtained when the value of t was low. In contrast, the accuracy gradually declines after the peak value of t. This decline may probably due to the fact that with the increasing of t, P^{ t } will become trivial and in turn mislead the classification. This situation is quite apparent in Figure 2. In addition, Szummer and Jaakola [28] found that small constant values of t (about t = 8) were effective on a dataset with several thousand examples.
Since the labeled training data is often deterministic, the transition matrix built over the labeled data is commonly treated as a unit matrix in semisupervised random walk methods. However, the best result for the yeast data was achieved when α = 0.75. This value gave the labeled nodes more freedom to move to each other, whereas the best result for the Gneg data was achieved when α = 0.95. Consequently, it is necessary to import the laziness parameter when the training data is not fully reliable; α can usually be set above 0.5.
Comparisons with SVM
Sensitivity and Specificity for yeast data using 10fold crossvalidation including the total predication accuracy
RaWa  SVM  

Sensitivity  Precision  Sensitivity  Precision  
MIT  57.38  68.29  54.9  65.0 
NUC  54.08  59.95  51.0  64.0 
CYT  68.90  55.67  72.1  47.7 
ME1  84.09  55.22  72.7  68.1 
EXC  51.43  64.29  57.1  58.8 
ME2  39.22  57.14  41.2  52.5 
ME3  77.91  74.71  81.6  76.4 
VAC  0    0   
POX  55.00  84.62  0   
ERL  1  83.33  0  0 
Total Accuracy  61.3±0.11  60.2±0.28 
Sensitivity and Specificity for gramnegative bacteria data using 10fold crossvalidation including the total predication accuracy.
RaWa  SVM  

Sensitivity  Precision  Sensitivity  Precision  
Cytoplasm  89.3  94.0  93.6  85.6 
Extracell  82.4  91.0  83.8  86.1 
Inner membrane  98.2  93.7  95.9  96.5 
Outer membrane  85.6  89.2  84.5  90.1 
periplasm  79.3  91.1  84.5  85.2 
Accuracy  93.3±0.24  92.1±0.46 
Each classifier was able to produce results with high sensitivity and specificity, but neither could identify the proteins that localized to the VAC site. The RaWa performs slightly better since it could predict the proteins that localized to POX and ERL, whereas the SVM could not. As illustrated in Table 2, both classifiers produced high sensitivities and specificities on the 5 locations, but according to the total accuracy listed in the last row, our classifier outperformed the SVM by 1%.
Discussion
Herein, we propose a novel classification model for label propagation through random walks on graphs. We first initialized an undirected complete graph over the labeled data whose data points act as the nodes and pairwise distances act as the weights. Then, labels and weights are employed to construct the state matrix and state transition matrix so that any node can start a random walk and propagate its label to any unlabeled data point after several steps. This model is also optimized by a kernel method and regularization so as to provide flexible control over the transition matrix.
One interesting possibility for future work is to develop algorithms for a clever selection of the labeled dataset and the kernel based on the data. In this study, we used the very simple Gaussian kernel with the identity covariance matrix, which likely does not exploit the similarity information conveyed in the data points.
Conclusions
Protein cellular and subcellular localization has been an important facet of research because of its role in characterizing protein functions and proteinprotein interactions. In this study, we developed a novel approach based on a random walk technique to predict protein localization. We demonstrated that this approach improves the accuracy of predicting protein (sub)cellular localization and is easy to train. When compared to the SVM classifier, our results are both competitive and promising.
Methods
Data preparation
Information about gramnegative and yeast data
Proteins  Site  Number 

Gramnegative bacteria proteins  Cytoplasm  140 
Extracellular  74  
Inner membrane  687  
Outer membrane  97  
Periplasm  116  
Yeast  Cytosolic or cytoskeletal (CYT)  463 
Nuclear (NUC)  429  
Mitochondrial (MIT)  244  
Membrane protein, no Nterminal signal (ME1)  163  
Membrane protein, uncleaved signal (ME2)  51  
Membrane protein, cleaved signal (ME3)  44  
Extracellular (EXC)  37  
Vacuolar (VAC)  30  
Peroxisomal (POX)  20  
Endoplasmic reticulum lumen (ERL)  5 
First, we represented a protein sample P with L amino acid residues by its evolutionary and sequence information. Here, for simplifying the formulation without losing generality, we use the numerical codes 1, 2... 20 to represent the 20 native amino acid types according to their single character symbols in alphabetical order. Then, the positionspecific scoring matrix (PSSM) was introduced as a descriptor of evolutionary information. The PSSM produced a matrix M_{ L×20 } where M_{ i→j } represents the score of the amino acid residue in the i th position of the protein sequence being mutated to amino acid type j through evolution.
where p_{1},p_{2},...,p_{20} are associated with the conventional amino acid composition, reflecting the occurrence frequencies of the 20 native amino acids in the protein P.
We thus represented the protein P by combining PSSM and PseAA in the following form $FT={\left[{\stackrel{\u0304}{P}}_{PSSM},{P}_{PseAA}^{\lambda}\right]}^{T}$.
In order to obtain the PseAA values, the lambda was set to 49, and the weight was 0.05. Since there are 3 proteins whose lengths were shorter than 49 amino acids, we obtained 1111 proteins with 89 features.
Problem formulation
Usually, a training set (X, C) specifies the set of labeled data and the set of their classes, n is the number of tuples in X, and then the classes of a test set can be predicted. We first considered an initial graph of the form G(V, E, W), which was constructed over the training set, where V is the set of nodes and its member v_{ i } only responds to (x_{ i }, c_{ i }). This graph is assumed to be complete; therefore the edge set E is trivial. We thus provided the labeled nodes with a certain probability to travel to other nodes (explained below). W represents the edge weight matrix sized n×n and indicates the pairwise similarities, w_{ ij } = sim(v_{ i },v_{ j }) = sim(x_{ i },x_{ j }).
We also let Y be a set of m labels that can be applied to nodes of the graph. After the initial weighted graph was generated, a state transition matrix P = [P_{ ij }]_{ n×n } was defined to infer the probability p_{ ij } that one node v_{ i } transitions to the state of node v_{ j }. P is generally computed as P = D^{1}W, where the diagonal matrix D = diag(W 1_{n}) and 1_{n} is a ndimensional vector with all values set to 1. We next converted y_{ i } into a vector of labels (i.e., $Y={\left[{y}_{1},{y}_{2},...,{y}_{n}\right]}_{m\times n}$), where y_{ i } = [y_{1i},y_{2i},...,y_{ mi }]^{ T }. Therefore, the label or state of v_{ i } is c_{ j } if and only if y_{ ji } = 1. Y can be also referred to as the state matrix of V or X.
Given the state matrix and transition matrix, a simple random walk on V is described as the process that the state y_{ i } of any node v_{ i } transitions with the probability p_{ ij } to the state y_{ j } of node v_{ j }. Thus, the states of labeled data are not encoded as the absorbing states. Random walks on readily labeled nodes are meaningless since we utilized the information already encoded in the partially labeled graph to help us predict labels, but the initial graph G is just a labeled graph. Therefore, given each data point lacking a label from the test set, we added it to graph G as an unlabeled node. The traditional classification problem has thus been converted to a node classification problem on a partially labeled graph by this method.
Random walk classification model
We next aimed to deduce a simple classifier based on the nodes that are labeled so it can be applied to predict the labels of the unlabeled nodes. Our solution was a state vector y that provides the label for an unlabeled data point x.
Node classification relies on a random walk originating at the unlabeled node v_{ j } and ends at one labeled node v_{ i } after several steps, and in this way, v_{ j } obtains its label from v_{ i }. If during the walk an unlabeled node reaches a labeled node for the first time, it will not remain at that node because the labeled nodes are not absorbing states; rather, the unlabeled node will move to another node with a certain probability. Since graphs G and G' are undirected and symmetric, a random walk that starts at v_{ j } and ends at v_{ i } can be also revertible.
where W^{+} denotes the pseudoreverse matrix of W. This is preferred over the inverse of W because W may sometimes be singular. w(V, v) is a column vector that indicates the similarity between the new node v and nodes in V.
Model training
In order to train an effective classifier, the labeled data should be fully utilized; however the influence of noise within the training data should be avoided, especially because biological measurements always contain a certain amount of noise.
Previous studies have treated the labeled nodes as absorbing states, such that P = I, but here we considered lazy random walks, i.e., P^{ t } = (α I + (1α)P)P^{ t }1, where α∈(0,1) is a laziness parameter indicating that the nodes will stay at their current positions with probability α
Further improvement with the kernel method and regularization
If the dimension of X is d, then the time cost for computing the kernel matrix and pseudoreverse matrix to build the model for our classifier is O(dn^{2}) and O(n^{3}), respectively. $\widehat{F}{K}^{+}$ requires a complexity of O(mn^{2}), where m ≤ n, so the overall cost is estimated as O(dn^{2}) + O(n^{3}) + O(mn^{2}) = O(max{d, n}n^{2}).
Notes
List of Abbreviations
 HMM:

Hidden Markov Models
 kNN:

k Nearest Neighbor
 SVM:

Support Vector Machine
 RBF:

Radial Basis Function
 PSI:

PositionSpecific Iterated
 BLAST:

Basic Local Alignment Search Tool
 PseAA:

Pseudo Amino acid
 RaWa:

Random Walk Classifier
 Gneg:

gramnegative bacteria
 ROC:

receiver operating characteristic curve.
Declarations
Acknowledgements
This work was supported by the National Natural Science Foundation of China under grant No. 61003180, No. 61070047 and No. 61103018; Natural Science Foundation of Education Department of Jiangsu Province under contract 09KJB20013; Natural Science Foundation of Jiangsu Province under contracts BK2010318 and BK2011442; Research Innovation Program for College Graduates of Jiangsu Province (CXLX12_0917); and The New Century Talent Project of Yangzhou University.
Declarations
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 8, 2013: Proceedings of the 2012 International Conference on Intelligent Computing (ICIC 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S8.
Authors’ Affiliations
References
 Bork P, Eisenhaber F: Wanted: subcellular localization of proteins based on sequence. Trends Cell Biol. 1998, 8: 169170. 10.1016/S09628924(98)012264.View ArticlePubMedGoogle Scholar
 Olof Emanuelsson: Predicting protein subcellular localisation from amino acid sequence information. Briefings in bioinformatics. 2002, 3 (4): 4361376.Google Scholar
 Kenichiro Imai, Kenta Nakai: Prediction of subcellular locations of proteins: Where to proceed?. Proteomics. 2010, 10 (22): 39703983. 10.1002/pmic.201000274.View ArticleGoogle Scholar
 Junfeng Xia, Xingming Zhao, Jiangning Song, Deshuang Huang: APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. Bioinformatics. 2010, 11: 174Google Scholar
 Junfeng Xia, Xingming Zhao, Jiangning Song, Deshuang Huang: Predicting proteinprotein interactions from protein sequences using Meta predictor. Amino Acids. 2010, 39 (5): 15951599. 10.1007/s0072601005881.View ArticleGoogle Scholar
 Paul Horton, Kenta Nakai: A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins. proceedings of fourth international conference on Intelligent Systems in Molecular Biology 1215 June 1996. Edited by: David J. States, David Johnson. 1996, St. Louis USA, 4: 109115.Google Scholar
 Paul Horton, Kenta Nakai: Better Prediction of Protein Cellular LocalizationSites with the k Nearest Neighbors Classifier. In proceedings of fifthinternational conference on Intelligent Systems in Molecular Biology: 2125June 1997; Halkidiki, Greece Terry Gaasterland, Theresa 1997, 5:147152.Google Scholar
 Feng ZP: An overview on predicting subcellular location of a protein. Silico Biology. 2002, 2 (3): 291303.Google Scholar
 Donnes P, Hoglund A: Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinform. 2004, 2 (4): 209215.Google Scholar
 Kuochen Chou, Hongbin Shen: Recent progress in protein subcellular location prediction. Analytical Biochemistry. 2007, 37: 0116.Google Scholar
 Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R: Predicting subcellular localization of proteins using machinelearned classifiers. Bioinformatics. 2004, 20 (4): 547556. 10.1093/bioinformatics/btg447.View ArticlePubMedGoogle Scholar
 Gardy JL, Brinkman FS: Methods for predicting bacterial protein subcellular localization. Nat Rev Microbiol. 2006, 4 (10): 741751. 10.1038/nrmicro1494.View ArticlePubMedGoogle Scholar
 Yu CS, Chen YC, Lu CH, Hwang JK: Prediction of protein subcellular localization Proteins. Proteins. 2006, 64 (3): 643651. 10.1002/prot.21018.View ArticlePubMedGoogle Scholar
 Nair R, Rost B: Protein subcellular localization prediction using artificial intelligence technology. Methods Mol Biol. 2008, 484: 435463. 10.1007/9781597453981_27.View ArticlePubMedGoogle Scholar
 Eric Juan, Chang JH, Li CH, Chen BY: Methods for Protein Subcellular Localization Prediction. proceedings of fifth International Conference on Complex, Intelligent, and Software Intensive Systems: 30 June  2 July 2011; Seoul, Korea. Edited by: Leonard Barolli, Fatos Xhafa, llsun You, Nik Bessis. 2011, 553558.Google Scholar
 Briesemeister S, Rahnenfuhrer J, Kohlbacher O: Going from where to why  interpretable prediction of protein subcellular localization. Bioinformatics. 2010, 26 (9): 12321238. 10.1093/bioinformatics/btq115.PubMed CentralView ArticlePubMedGoogle Scholar
 Laurila K, Vihinen M: PROlocalizer: integrated web service for protein subcellular localization prediction. Amino Acids. 2011, 40 (3): 975980. 10.1007/s007260100724y.PubMed CentralView ArticlePubMedGoogle Scholar
 Lovász L: Random Walks on Graphs: A survey. Combinatorics, Paul Erdӧs is Eighty (Vol. 2), Keszthely(Hungary). 1993, 2: 0146.Google Scholar
 Smriti B, Graham C, Muthukrishnan S: Node classification in social networks. Arxiv preprint arXiv. 2011, 11013291.Google Scholar
 Gregory Lawler: Simple Random Walk. Intersections of Random Walks Modern Birkhäuser Classics. 2013, 1146.Google Scholar
 Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, William Stafford Noble: Semisupervised protein classification using cluster kernels. Bioinformatics. 2005, 21: 32413247. 10.1093/bioinformatics/bti497.View ArticleGoogle Scholar
 Min R, Bonner A, Li J, Zhang Z: Learned randomwalk kernels and empiricalmap kernels for protein sequence classification. J Comput Biol. 2009, 16 (3): 457474. 10.1089/cmb.2008.0031.View ArticlePubMedGoogle Scholar
 Freschi V: Protein function prediction from interaction networks using a random walk ranking algorithm. Proceedings of the seventh IEEE international conference on Bioinformatics and Biomedicine BIBM(2007): 24 November California USA. Edited by: Xiaohuo Hu, lon mandoiu, Zoran Obradovic, Jiali Xia. 2007, 4248.Google Scholar
 Yuan Z: Prediction of protein subcellular locations using Markov chain models. FEBS Letters. 1999, 451: 2326. 10.1016/S00145793(99)005062.View ArticlePubMedGoogle Scholar
 Caragea C, Caragea D, Silvescu A, Honavar V: Semisupervised prediction of protein subcelular localization using abstraction augmented Markov models. BMC Bioinformatics. 2010, 11 (Suppl 8): S610.1186/1471210511S8S6.PubMed CentralView ArticlePubMedGoogle Scholar
 Xiaohua Xu: Random Walk Learning on Graph. PhD thesis. 2008, Nanjing University of Aeronautics and Astronautics, Computer Science DepartmentGoogle Scholar
 Chang CC, Lin CJ: LIBSVM: a library for support vector machines. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]
 Szummer M, Jaakkola T: Patially labeled classification with markov random walk. Advances in neural Information Processing Systems. 2002, 14: 945952.Google Scholar
 Kuochen Chou, Hongbin Shen: Largescale predictions of Gramnegative bacterial protein subcellular locations. J Proteome Res. 2007, 5: 34203428.Google Scholar
 Shen HB, Chou KC: GnegmPLoc: a topdown strategy to enhance the quality of predicting subcellular localization of Gramnegative bacterial proteins. J Theor Biol. 2010, 264 (2): 326333. 10.1016/j.jtbi.2010.01.018.View ArticlePubMedGoogle Scholar
 Shen H, Chou K: NucPloc: a new webserver for predicting protein subnuclear localization by fusing PseAA and PsePSSM. Protein Engineering Design & Selection. 2007, 20 (11): 561567. 10.1093/protein/gzm057.View ArticleGoogle Scholar
 Azran A: The rendezvous algorithm:Multiclass semisupervised learning with markov random walks. proceedings of the twentyforth International Conference on Machine Learning 2024 June 2007; Corvallis, Oregon USA. Edited by: Zoubin Ghahramani. 2007, 11441151.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.