Identification of Proteins Secreted by Malaria Parasite into Erythrocyte using SVM and PSSM profiles

Background Malaria parasite secretes various proteins in infected RBC for its growth and survival. Thus identification of these secretory proteins is important for developing vaccine/drug against malaria. The existing motif-based methods have got limited success due to lack of universal motif in all secretory proteins of malaria parasite. Results In this study a systematic attempt has been made to develop a general method for predicting secretory proteins of malaria parasite. All models were trained and tested on a non-redundant dataset of 252 secretory and 252 non-secretory proteins. We developed SVM models and achieved maximum MCC 0.72 with 85.65% accuracy and MCC 0.74 with 86.45% accuracy using amino acid and dipeptide composition respectively. SVM models were developed using split-amino acid and split-dipeptide composition and achieved maximum MCC 0.74 with 86.40% accuracy and MCC 0.77 with accuracy 88.22% respectively. In this study, for the first time PSSM profiles obtained from PSI-BLAST, have been used for predicting secretory proteins. We achieved maximum MCC 0.86 with 92.66% accuracy using PSSM based SVM model. All models developed in this study were evaluated using 5-fold cross-validation technique. Conclusion This study demonstrates that secretory proteins have different residue composition than non-secretory proteins. Thus, it is possible to predict secretory proteins from its residue composition-using machine learning technique. The multiple sequence alignment provides more information than sequence itself. Thus performance of method based on PSSM profile is more accurate than method based on sequence composition. A web server PSEApred has been developed for predicting secretory proteins of malaria parasites,the URL can be found in the Availability and requirements section.


Background
The human malaria caused by Plasmodium falciparum has been one of the major infectious diseases in the world causing illness in 300 to 600 million people leading to 2 to 3 million deaths annually [1]. In addition, it is putting huge economic burden on affected countries particularly in Asian and African subcontinents. In order to develop effective drugs and vaccines against this parasite it is important to identify novel potential drug/vaccine targets. Parasite secretes an array of proteins within the host erythrocyte and beyond to facilitate its own survival within the host cell and for immunomodulation. These proteins secreted by parasite can serve as potential drug/vaccine targets. The identification of secretory proteins of Plasmodium falciparum has got limited success, since experimental identification of these proteins is rather difficult due to complex nature of parasite.
In silico prediction of secretory proteins is need of time in the era of genomics where thousands of genomes have been completely sequenced including those of P. falciparum (size 22.8 MB; 14 chromosomes and 5300 proteins) [2]. It has been shown in past that secretory proteins of eukaryotes have signal sequence at N-terminus, which can be used to predict its secretory nature. One of the commonly used programs for predicting secretory proteins of eukaryotes is TargetP [3]. Though TargetP is successful for eukaryotic protein but fails to predict known P. falicparum secretory proteins like PfEMP1. The reason of failure of TargetP for P. falciparum is due to its complex life-cycle that alternate between vertebrate and invertebrate. Thus it is not possible to use subcellular localization methods developed either for eukaryotes [4] or prokaryote [5] for localization of P. falciparum proteins. There is a need to develop organism specific methods [6]. Recently, two groups independently identified the signal (PEXEL) or motif (VTS) in secretory proteins of P. falciparum partly responsible for proteins export from parasite to erythrocyte [7,8]. However, a number of well known and experimentally documented secretory/erythrocyte membrane associated proteins lack these motifs, thus emphasizing the existence of multiple pathways that operate in parallel [9]. With the completion of Plasmodium genome sequence, the challenge is to combine experimental and bioinformatics tools in order to develop algorithm with high predictive value for secretory proteins of malaria parasite.
In general, two important reasons for failure of these motif based methods are; i) all secretory proteins do not necessarily have signal peptide particularly those secreted by non-classical pathways and ii) location of signal is not conserved in protein, since it may be found on either Nterminal or C-terminal or in middle of proteins [10]. In order to overcome these limitations several groups have developed methods based on amino acid composition or dipeptide composition of proteins [6,11,12]. Recently two web servers (Signal-CF and Signal-3L) have been developed, which provides key steps important for predicting secretory proteins [13,14]. Though composition based methods have been developed for eukaryotic or prokrayotic proteins but till date no method has been developed for P. falciparum specific proteins. It has been demonstrated in past that organism specific methods perform better than general methods [6]. Thus there is need to develop method especially for predicting secretory proteins of P. falciparum.
In this paper, we describe a method developed for predicting secretory proteins of malaria parasite. First, amino acid sequence of a protein has been converted into fixed length patterns by computing various type of composition like amino acid, dipeptides. Then machine-learning technique Support Vector Machine (SVM) has been used to discriminate secretory and non-secretory protein. For the first time in this study, evolutionary information has been used for predicting secretory proteins. The evolutionary information in form of PSSM profile was obtained from PSI-BLAST search against "nr" databases. A web server has also been developed for predicting secretory proteins of malaria parasite.

Analysis of amino acid composition
We analyzed the amino acid composition of both secretory and non-secretory proteins. As shown in figure 1, the frequency of occurrence of amino acid alanine, cysteine, isoleucine, lysine, glutamine and threonine are higher in secretory proteins than non-secretory proteins, while composition of aspartic acid, phenylalanine, glycine are higher in non-secretory proteins than secretory proteins. There is a major difference of composition of asparagines in non-secretory protein (very high) than secretory protein. This means secretory proteins can be discriminated from non-secretory proteins based on their amino acid composition. It has been shown in previous studies that secretory proteins have signal sequence at N-termini). Thus it is important to compare composition of various parts of secretory and non-secretory proteins separately. As shown in Figure 2, N-terminal composition of two type of protein is quite different; magnitude of biasness is much higher than compositional biasness of whole protein. Similarly, composition of C-termini of secretory and non-secretory proteins is quite different (Figure 3). In comparison to it, difference in composition of central region of secretory and non secretory proteins was low ( Figure 4).

Composition based SVM models
It was observed that amino acid composition of secretory proteins was somewhat different from that of non-secretory proteins. Thus a SVM based classifier was developed using amino acid composition where amino acid composition was used as input vector of dimension 20. Different kernels and parameters of SVM were tried. The performance of our method on different thresholds is shown in Table 1. We got accuracy of around 84% with MCC 0.67 with nearly equal sensitivity and specificity. This model correctly predicts 76% secretory proteins at 96% specificity using RBF kernel. It has been observed that localization methods based on dipeptide composition perform better than amino acid composition based methods [4]. This is because dipeptides also provides information about local order of residues in addition to amino acid composition. For present study we developed SVM based method using dipeptide composition where dipeptide composition was used as input vector or dimension 400. As shown in Table 1, we obtained maximum accuracy of 86.45% with MCC 0.74 using dipeptides based SVM model. The SVM model based on dipeptides composition performed better than SVM model based on amino acid composition.

Split Amino Acid Composition
It has been observed that secretory proteins have signals either at N or C terminus. In order to utilize the compositional biasness in terminus of secretory and non-secretory proteins, we developed SVM models using split amino acid and dipeptides composition. As shown in Table 2, we got maximum accuracy 86.20% and 88.22% using split amino acid and dipeptides composition respectively. This is slightly better than accuracy achieved using whole composition. We found best performance with 25 N-terminal, 25 C-terminal and remaining protein.

PSSM based SVM models
In past, multiple sequence alignment information in form of position specific scoring matrix (PSSM) has been used for developing methods [15][16][17]. In this study, PSSM has been used for predicting secretory proteins. First we created PSSM profile for each protein using PSI-BLAST search Amino acid composition chart of secretory and non-secretory proteins Figure 1 Amino acid composition chart of secretory and non-secretory proteins. Red and green lines represent the secretory and non-secretory proteins respectively. against nr database with three iterations, at cut-off 0.01. Secondly, we computed a vector of dimension of 400 from PSSM matrix. Finally a SVM model was developed using PSSM and we achieved maximum accuracy of 92.66% with MCC 0.86. In addition this model was able to correctly predict 73% secretory proteins at specificity 100%. This clearly demonstrates that PSSM provide more information than single sequence and is useful for predicting secretory proteins.

Pseudo amino acid composition (PseAAC)
In past PseAAC has been widely used for classifying the proteins and subcellular localization methods. Thus we also tried to develop SVM models using simple PseAAC. In this study we have computed pseudo amino acid composition using PseAAC [18,19]. We found that the performance of PseAAC based model is better than model based on amino acids or dipeptides composition. However, performance is poor than our PSSM based model (Table 3). We tried two characters Hydrophobicity and PI and performance of which was nearly same.

Benchmarking
In order to compare performance of our method with existing methods, we predicted proteins used in this study using existing methods. Firstly, we applied PlasmoHT that is based on motif and specially developed for predicting secretory proteins in Plasmodium [8]. In order to use Plas-moHT one need to provide PlasmoDB ID, as all proteins in our dataset are not from PlasmoDB database so it could not be applied on all the proteins [20]. This method correctly predicted 146 out of 246 secretory proteins (six proteins do not have Plasmodb ID). PlasmoHT fails to predict 100 secretory proteins since all secretory proteins do not have conserved signal motif. It also correctly predicted 54 out of 55 non-secretory proteins obtained from PlasmoDB. It was not possible to apply PlasmoHT directly on 197 non-secretory proteins obtained from Swiss-Prot as PlasmoHT need PlasmoDB ID. Thus, we manually examined the Swiss-Prot entries and found 53 entries have ORFname (matches with PlasmoDB ID) in field Gene name. Six Swiss-Prot entries out of 53 have two or more than two ORF names (it not necessary that every protein will have one ORF). We examined 47 proteins for which single PlasmoDB ID was available and found that two A Graph depicting the 25 amino acid of N-termini Amino acid composition of secretory and non-secretory proteins proteins had PlasmoHT motif. It means PlasmoHT correctly predicted 45 out of 47 non-secretory proteins. In total PlasmoHT correctly predicted 99 out 102 non-secretory proteins. We were not able to locate PlasmoDB ID for all proteins extracted from Swiss-Prot, which may be due to number of reasons i.e. modified form of protein; mutated proteins or protein fragments. Secondly we applied commonly used method for predicting secretory proteins TargetP on our dataset, it correctly predicted 163 out of 251 secretory proteins (unable to predict one protein due to its large size) [3]. It also correctly predicted 160 out of 252 non-secretory proteins. In summary, we achieved sensitivity 64.94%, specificity 63.49% and accuracy 64.21% using TargetP on our dataset. Thirdly, we evaluated performance of two commonly used subcellular localization methods PA-SUB and WoLF PSORT on our dataset [21,22]. PA-SUB first extract the features of similar sequences to query sequence from Swiss-Prot then it uses machine learning model for predicting subcellular location of query protein [21]. WoLF PSORT converts protein amino acid sequences into numerical localization features; based on sorting signals, amino acid composition and functional motifs such as DNA-binding motifs. After conversion, a simple k-nearest neighbor classifier is used for prediction [22]. Performance of both methods is shown in Table 4, both methods fail to predict secretory proteins of P. Falciparum.

Discussion
Plasmodium falciparum during its asexual stage within the host erythrocyte remodels the host cell displaying several dramatic changes, which affects membrane rigidity surface antigenicity and permeability. These changes aid in the pathogenesis and also help the parasite survival within null host cell by nutrient acquisition [23]. It has been estimated that an array of parasite derived antigens are expressed on infected cell membrane [24,25]. However, only a few protein such as PfEMP-1, rifin and stevor family proteins have been conclusively proven to be on the surface of infected erythrocyte membrane. The search of parasite derived proteins within the host cell and infected membrane surface remains one of the most warranted areas in malaria research for understanding the pathogenesis of disease, and to find out potent vaccine A Graph depicting the 25 amino acid of C-termini amino acid composition of secretory and non-secretory proteins candidate molecule. Recently, two independent groups [7,8] have done in silico prediction of proteins exported into the host erythrocyte (a 'secretome') based on the Plasmodium export element (PEXEL) [7] and the vacuolar transport signal (VTS) [8] motifs. These motifs were identified by bioinformatic analysis of aligned N-terminal sequences from proteins known to be exported from the parasitophorous vacuole (PV) into the erythrocyte. Whereas Hiller et al. [8] used reiterative alignments to search for motif while Marti et al. [7] used a search protocol based on the presence of signal sequence (SS) on exon I. Both reported motifs contains a short stretch of alternating charged and hydrophobic amino acids separated by uncharged amino acids located a short distance downstream of the SS. Functional role of PEXEL/VTS motif has been demonstrated by GFP fusion with SS followed by live fluorescence imaging and mutational analysis of PEX-ELl/VTS motif. However, PEXEL/VTS dependent protein trafficking cannot be typified due to over and possible incorrect timed expression of chimeric GFP fusion protein [9]. Moreover RESA-GFP chimera containing PEXEL/VTS was reported to be mistargeted to lumen of parasitovorous vacuole [26]. Besides well known exported proteins the predicted protein also includes several proteins for which export into the erythrocyte had not previously been shown, including several heat-shock proteins, kinases, phosphatases and putative transporters [8]. But one of the major limitation of the prediction based on PEXEL/VTS motif is that it could not predict proteins lacking PEXEL/ VTS motif but experimentally demonstrated to be exported into the erythrocyte, such as P. falciparum skeleton-binding protein (PfSBP), membrane-associated histidine-rich protein (MAHRP) and coat protein (COP)II, all of which seem to be associated with vesicles and/or Maurer's clefts [27]. Moreover, the above-mentioned motif based methods gets setback in case of members of the vir supergene family (homologues vir/bir/gir), proteins predicted to be expressed on the erythrocyte surface [28] since none of these have SS or PEXEL/VTS motifs. Unlike many parasite-encoded proteins exported into the erythrocyte, PfEMP1 lacks an SS. Although both groups were able to identify a conserved sequence with biophysical characteristics similar to those of the more classical PEXEL/VTS, by creating mini-PfEMP1 reporter constructs consisting of their respective PfEMP1 PEXEL/VTS motif, GFP or YFP and the conserved C terminus of PfEMP1 (including the TM). But the location of PEXEL/VTS in PfEMP1 is contradicting. Marti et al. [7] described the motif to be located ~16-32 amino acids in from the N terminus, whereas Hiller et al. [8] reported the motif to be A Graph depicting the amino acid composition of middle part of secretory and non-secretory proteins 300 amino acids further downstream, within a semiconserved Duffy-binding-like (DBL) domain. This discrepancy in the location of PEXEL/VTS motifs points to an ambiguity for the existence of identical and universal motif for exported proteins predicted to be exported into erythrocyte. Although Florens et al. [29] have predicted about 36 hypothetical proteins of the parasite to be located on infected erythrocyte surface using multidimen-  Nevertheless, the literature strongly advocates the existence of multiple pathways that are cumulatively responsible for the export of parasite proteins in erythrocytes [9].

Conclusion
The bioinformatics approach used in this study is standard approach, which is commonly used for predicting subcellular localization of proteins. In addition evolutionary information in form of PSSM has been used first time for predicting secretory proteins in malaria parasite. Our model is equally applicable to wide range of secretory proteins where most of method fails. One of the major advantage of method describes in this study is that it is based on complete sequence rather than on small region/ motif. The server developed for predicting secretory proteins will be very useful for researchers working in the field of malaria.

Non-secretory or Negative dataset
Selection of negative dataset is always a challenge. We normally prefer to get negative dataset from manually annotated database Swiss-Prot. In this study, we extracted non-secretary protein from Swiss-Prot using SRS with query "Plasmodium Falciparum (Organism) but not Secreted (comment)". This way we got 197 non-secretory proteins and we required 252 non-secretory proteins in order to make both secretory and non-secretory proteins equal. Thus we used another database PlasmoDB to extract remaining 55 non-secretory proteins. We extracted nuclear proteins from PlasmoDB and randomly picked up 55 proteins from ~300 nuclear proteins. This way we got 252 non-secretory proteins from two sources (197 Swiss-Prot and 55 PlasmoDB). We extracted equal number of negative examples in order to evaluate performance from single parameter like accuracy and MCC.

Composition
The aim of calculating composition of proteins is to transform the variable length of protein sequence to fixed length feature vectors. This is important to classify proteins using machine-learning techniques because they required fixed length pattern. The information of proteins can be encapsulated to a vector of 20 dimensions using amino acid composition of the protein [4,5]. In addition to amino acid composition, dipeptide composition was also calculated which present protein by a vector of 400 dimensions. The advantage of dipeptide composition over amino acid composition is that it encapsulates information about the fraction of amino acids as well as their local order.

Split Amino Acid Composition (SAAC)
We split protein in three parts and compute composition of each part of protein separately. This way we created a vector of a dimension 60 (3 × 20) instead of 20 in case of amino acid composition [34]. In SAAC each protein was divided into three parts: (i) 25 amino acids of the N terminus, (ii) 25 amino acids of the C terminus, and (iii) remaining protein length after removing 25 amino acids from N-and C-terminus. The rationale behind using SAAC is that difference in composition of secretory and non-secretory proteins is more prominent if terminal residues are compared separately instead with whole protein.
It is known that most of secretory proteins have signals at N-terminal. The advantage of SAAC over standard amino acid composition is that it provides greater weight to proteins that have a signal at either the N or C.

Multiple sequence alignment in form of PSSM profiles
In the present study multiple sequence alignment in the form of PSSM has been used for predicting secretory proteins. Recently number of study used PSSM profile composition for developing prediction methods; for example MemType-2L, EzyPred, Tbpred, DNAbinder, SRTpred used for predicting membrane, enzyme class, subcellular localization of M. Tuberculosis, DNA binding and secretory proteins respectively [34][35][36][37][38]. The PSSM for each sequence was generated by performing PSI-BLAST search against 'nr' database using three iterations with cut off evalue 0.001. For a sequence of length N residues, PSSM is represented by an N × 20 matrix (dummy residue 'X' is ignored) [29,30]. Each element of this matrix, m [i, j], provides information on evolutionary conservation of residue type j at sequence position i. We coverted this matrix in a vector of dimension 400, by computing composition of occurrences of each type of amino acid corresponding to each type of amino acids in protein sequence. It means for each column we will have 20 values instead of one [34,37]. Every element in this input vector was subsequently divided by the length of the sequence and then scaled to the range of 0-1 by using the standard sigmoid function as described by Rashid et al. [34]. The resultant matrix with 400 elements was used as input feature for SVM.

Support Vector Machine
In this study we implemented SVM using SVM_light package which allows choosing number of parameters and kernels (e.g. linear, polynomial, radial basis function, sigmoid) or any user-defined kernel. The selection of kernel is very important in SVM, which is analogous to choosing architecture in ANN. In this study, learning was carried out using three kernels linear, polynomial and RBF.

Evaluation
The performance of any prediction algorithm is often checked by jack-knife tests or cross-validations. In current study the performance of all the methods and models was evaluated using 5-fold cross-validation in which the dataset was randomly divided into five equal sets, out of which four sets were used for training and the remaining one for testing. This procedure was repeated five times in such a way that each set is tested once. The final performance was calculated by averaging over all five sets. The performance of our method was computed by using following standard parameters. (c) Accuracy: It is percentage of correctly predicted proteins (secretory and non-secretory proteins).

(d) Mathew's correlation coefficient (MCC):
It is considered to be the most robust parameter of any class prediction method. MCC equal to 1 is regarded as perfect prediction while 0 for completely random prediction.
where TP and TN are truly or correctly predicted secretory and non-secretory proteins. FP and FN are wrongly predicted secretory and non-secretory proteins.

Authors' contributions
RV developed computer programs, implement SVM and developed the web server. AT and SK collected data, annotated proteins and validated results. GPSR and GV conceived the idea, coordinated the project and refined the drafted manuscript. GPSR guided its conception and helped in interpretation of data and gave overall supervision to the project. All authors read and approved the final manuscript.