- Research article
- Open Access
Meta-analytic approach to the accurate prediction of secreted virulence effectors in gram-negative bacteria
© Sato et al; licensee BioMed Central Ltd. 2011
- Received: 13 July 2011
- Accepted: 14 November 2011
- Published: 14 November 2011
Many pathogens use a type III secretion system to translocate virulence proteins (called effectors) in order to adapt to the host environment. To date, many prediction tools for effector identification have been developed. However, these tools are insufficiently accurate for producing a list of putative effectors that can be applied directly for labor-intensive experimental verification. This also suggests that important features of effectors have yet to be fully characterized.
In this study, we have constructed an accurate approach to predicting secreted virulence effectors from Gram-negative bacteria. This consists of a support vector machine-based discriminant analysis followed by a simple criteria-based filtering. The accuracy was assessed by estimating the average number of true positives in the top-20 ranking in the genome-wide screening. In the validation, 10 sets of 20 training and 20 testing examples were randomly selected from 40 known effectors of Salmonella enterica serovar Typhimurium LT2. On average, the SVM portion of our system predicted 9.7 true positives from 20 testing examples in the top-20 of the prediction. Removal of the N-terminal instability, codon adaptation index and ProtParam indices decreased the score to 7.6, 8.9 and 7.9, respectively. These discrimination features suggested that the following characteristics of effectors had been uncovered: unstable N-terminus, non-optimal codon usage, hydrophilic, and less aliphathic. The secondary filtering process represented by coexpression analysis and domain distribution analysis further refined the average true positive counts to 12.3. We further confirmed that our system can correctly predict known effectors of P. syringae DC3000, strongly indicating its feasibility.
We have successfully developed an accurate prediction system for screening effectors on a genome-wide scale. We confirmed the accuracy of our system by external validation using known effectors of Salmonella and obtained the accurate list of putative effectors of the organism. The level of accuracy was sufficient to yield candidates for gene-directed experimental verification. Furthermore, new features of effectors were revealed: non-optimal codon usage and instability of the N-terminal region. From these findings, a new working hypothesis is proposed regarding mechanisms controlling the translocation of virulence effectors and determining the substrate specificity encoded in the secretion system.
- Support Vector Machine
- Codon Usage
- Coexpression Analysis
- Codon Adaptation Index
- Discrimination Feature
Protein secretion and translocation into eukaryotic host cells are key processes in the virulence of pathogenic bacteria . So far, six different secretion systems have been described for Gram-negative bacteria [2, 3]. Among these, the type III secretion system (TTSS) is a representative apparatus that secretes and translocates virulence proteins out of bacterial cells. Representative models of pathogens using TTSS as the main secretion system are the animal pathogens Salmonella, Yersinia, and Shigella and the plant pathogens Pseudomonas and Xanthomonas. Since effector secretion is an important strategy for the virulence of these bacteria, many research groups in the bacterial infection field have made great efforts to identify secretion substrates [4–8]. In these studies, elaborate proteomic and genetic screening methods have been established and many effectors have been identified by genome-wide high-throughput screens such as the translocation assay of CyaA-fused proteins from libraries of transposon-mediated random insertions in the genome . However, the effector repertories, even for deeply investigated pathogens such as Salmonella, have had to be revised continuously [10, 11]. Moreover, considering the complexity and elaborated infectious strategy of Salmonella, there may be more effectors yet to be detected. This situation indicates that the utility of the established genome-wide experimental screenings is limited and that new approaches will be necessary to develop a complete catalogue of effectors. Bioinformatics-assisted effector identification is a promising alternative approach. Previous studies have successfully identified novel effectors by using homology-search-based screening  or feature-extraction-based approaches such as promoter motif search and analysis of N-terminal amino acid composition bias . Furthermore, recent progress in sequencing technology has enabled whole genomes to be sequenced quickly, at reasonable cost . In fact, the genomes of many pathogenic bacteria have been sequenced and continue to be sequenced at a growing speed, enabling bioinformatics-based identification of virulence effectors for an expanding number of such bacteria. This supports the development of various prediction tools. However, accurate prediction of TTSS substrates is a very challenging problem because no clear consensus motif has been defined for these substrates. In addition, the secretion mechanism is still largely uncharacterized at the molecular level, as exemplified by the absence of co-purified crystal structures of the effector and its translocator. Homology searching is a straightforward method of sequence-based screening . However, effector genes generally evolve rapidly to adapt to different hosts  or to escape from a severe immune response by the host, which makes homology-based approaches difficult. Moreover, the homology search approach alone cannot identify novel effectors. As another bioinformatics approach, machine-learning-based methods have been recently developed. Most of these approaches implement the position-specific amino acid composition profiling  or naïve Bayes approach  to capture a weak signal and composition bias in the N-termini of effectors. Enrichment of Thr/Ser and depletion of Glu/Asp residues in the N-terminal region is a feature of TTSS substrates commonly observed for a wide range of organisms that utilize TTSS . Other machine-learning techniques, using support vector machine (SVM) [17, 18] or artificial neural network  approaches, have also been developed. In these approaches, many more feature parameters are included in addition to sequence motif and composition bias, e.g., GC content, secondary structure prediction, and phylogenetic profiling. Using these tools, it has been reported that low GC content, atypical phylogenetic relationships showing characteristics of horizontal gene transfer, and enrichment of coiled regions with high solvent accessibility are useful for discriminating effector genes from non-effectors, although findings regarding the N-terminal flexibility of the secretion substrate are controversial [15, 16, 19]. These prediction tools have achieved a certain degree of accuracy and, combined with experimental proteomic analysis, have successfully identified novel effectors . However, none of these tools has achieved sufficient accuracy for genome-scale identification as a sole screening device owing to the high rates of both false positives and false negatives. The ultimate goal of a prediction system is to produce an accurate effector candidate list that could help increase the efficacy of gene-directed experimental verification. To satisfy this demand, true positives must be enriched in the top-20 to -30 ranking of the whole genome prediction. However, as one example, if existing prediction tools such as SVM-based Identification and Evaluation of Virulence Effectors (SIEVE) , BPBAac  and Effective-T3  were applied to all the genes of LT2, the list of the top 20 in the prediction ranking would include only two to five known effectors. Under this situation, experimentalists spend much labour to identify novel effectors by gene-targeted verification based on the candidate list predicted by the existing tools. This also suggests that some characteristics encoded in the TTSS substrates are still undiscovered. In this study, we propose a refined pipeline to predict secreted virulence proteins that is based on a combination of a machine learning approach that extracts discrimination features from amino acid sequences, nucleotide sequences and phylogenetic analyses, and data mining of gene expression databases. We confirmed that the optimized prediction system outperformed pre-existing prediction tools and that the prediction was accurate enough to conduct efficient gene-directed experimental verification. We also discuss previously unidentified or uncharacterized features of the virulence effectors, which were suggested through the refinement process of the prediction system.
Dataset construction and prediction pipeline
In this analysis, we constructed a new approach for predicting effectors from discrimination features derived from the nucleotide and amino acid sequences and from DNA microarray experimental data. In our prediction system, a meta-analytic approach was adapted, beginning with a machine-learning-based discriminant analysis followed by coexpression analysis and other simple criteria-based filtering. To assess the accuracy of our system, a representative model organism was selected, Salmonella enterica serovar Typhimurium LT2 (hereafter denoted LT2). Another well-studied plant pathogen, Pseudomonas syringae DC3000 (hereafter DC3000), was also selected to test the wider feasibility of our system. As a gold-standard set of positive examples, i.e., known effectors, 40 and 28 effectors from LT2 and DC3000, respectively, were collected from the literature and from our recent experimental results (See Additional file 1: Supp_Table_knownEffector.xls). All other non-effector genes were treated as negative examples and test samples for novel effector screening. We noticed that the translation initiation site of one known effector in LT2 was incorrectly annotated in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Hence, we re-annotated the open reading frame (ORF) positions in the set of LT2 as described in the Methods section.
Statistics regarding representative discrimination features (classifiers) in the support vector machine portion of the analysis
Statistics for features used in the SVM part of discriminant analysis
(n = 40)
(n = 4510)
The CAI represents how codon usage of a given gene is optimized for effective translation, which was introduced by Sharp et al.. It has been revealed that there is a selection pressure on the synonymous site, in which the nucleotide substitution does not cause the amino acid change. The selection pressure produces the codon usage bias in the synonymous site. In enteric bacteria, synonymous codon bias increases with gene expression levels . This has been thought to be due to selection in favor of efficiently translated codons . Each amino acid is encoded by one to six codons and each codon is associated with anti-codon tRNA. Since there is a variation for the copy numbers of tRNAs, the codon corresponding to the highly expressed tRNA is thought to have translational advantages in terms of rate and accuracy. Hence, the codon usage tends to be optimized in the highly expressed genes such as ribosomal proteins and chaperones. The CAI value tells us important biological implications related to translation. Recently, it has been suggested that the codon usage of the Sec dependent substrates tends to be non-optimal (i.e., low CAI) [25, 26]. In this study, we estimated the CAI values for known effectors and those for proteome of Salmonella.
The values were lower in the group of effectors, which may be because the codon usage of horizontally-acquired genes is generally not optimized at the time of transfer. As expected, the CAI values in the effector group were lower than those of the proteome in LT2. Although the difference is likely to stem from the same source as in low GC content, i.e., the alien nature of the effectors, the degree of difference in CAI values (Student T-test p-value = 0) is greater than the difference in GC contents (Student T-test p-value = 6.66 × 10-6). Therefore, the use of CAI in the SVM analysis is expected to refine the overall accuracy of the discriminant analysis. As for the N-terminal instability index, many researchers have reported that the predicted secondary structure elements (coil, alpha helix, beta sheet) showed enriched coil regions in the N-termini of effectors [16–19]. In the present study, we estimated N-terminal instability through POODLE-S, a program that considers the context of a given region in calculating a score . The index also showed a significant difference between members of the effector group and those of the proteome in general. Furthermore, physicochemical parameters estimated from the amino acid sequence by the ProtParam program also showed differences between the two groups. To summarize the ProtParam features, the effectors were likely to be unstable, less aliphatic and hydrophilic. These tendencies were also observed for known effectors of DC3000 (See Additional file 2 Supp_Table_StatDC3000.xls). As for the charge and pI parameters, the values showed opposite relationships between LT2 and DC3000 (Table 1 and Additional file 2 Supp_Table_StatDC3000.xls). The genes in the effector group have relatively negative and low pI values in LT2, whereas effectors in DC3000 have relatively positive and high pI values, compared with those of the proteome in general. This may reflect differences in environmental conditions in which the effectors function. The rate of effector evolution was estimated to be faster than that of housekeeping genes, as reported in previous studies [28, 29].
Predictive power of the SVM-based discriminant analysis
Predictive powers for the various combinations of feature values in the LT2 validation
# of TPs in top 20
To examine the impact of the individual feature values, we extracted five sets of feature values and assessed the AUC and RANKavg for each of them. Removal of the POODLE-S index from the feature matrix decreased the average AUC values from 0.993 to 0.989, and the RANKavg value increased from 40.5 to 57.9. The second parameter set showing a notable contribution to discriminative power refinement was the set of physicochemical parameters from ProtParam. In this case, the AUC value was also decreased from 0.993 to 0.989, and the RANKavg value increased from 40.5 to 56.7. Although the CAI parameters showed only moderate differences if they were removed from the discriminant matrix (e.g., a decrease in AUC value from 0.993 to 0.991), we confirmed the statistical significance of these differences. Furthermore, the efficacy of the index was also confirmed by two cross-species prediction models: LT2-to-DC3000, and DC3000-to-LT2 prediction models (See Additional file 3 Supp_Table_CrossPred.xls). The importance of these three parameters: Poodle-S, CAI, and ProtParam, was also confirmed by the decrease in average true positive counts in the top-20 from 9.7 to 7.6, 8.9 and 7.9 on removal of Poodle-S, CAI, and ProtParam, respectively (Table 2), which corresponded to one or two losses of true positives.
On the other hand, the dN/dS parameter showed a negligible difference if removed from the matrix, though the dN/dS values were estimated to be significantly higher for the effector group than for the proteome. This may be because the feature represented by dN/dS correlates highly with features indicating an alien origin for genes, such as low GC content and low CAI. The insufficiency of orthologous sequences due to the rapid turnover of effector genes could make the dN/dS parameter ineffective. Hence, the inclusion of sequence data from whole genome shotgun reads increases the effective orthologous sequences of some effectors and may further refine the accuracy of our system. The charge and pI value parameters showed different tendencies between LT2 and DC3000. Inclusion of these parameters decreased the discriminant power in the cross-species prediction (Additional file 3 Supp_Table_CrossPred.xls), as expected from the opposite tendencies of the effectors between the two organisms (Table 1 and Additional file 2 Supp_Table_StatDC3000.xls).
N-terminal flexibility prediction method and its impact on effector discrimination
N-terminal codon usage of effectors is de-optimized
The inclusion of CAI in the SVM-based prediction strategy had a limited effect on the refinement of overall prediction accuracy. One reason is that the feature of low CAI is partially correlated with low GC content, which is a symptom of alien origin. However, we have considered that the low CAI feature may stem from other aspects of effectors besides alien nature. To investigate codon bias in the N-terminal regions of the effectors, we compared the N-terminal CAI with the entire CAI. As a result, the CAI of the known effectors for the 25 N-terminal aa sites showed a significantly lower average value (0.53) in the LT2 model than that (0.57) for the entire protein sequence (Student's T-test, p-value 0.002). To investigate the positional difference of the bias, the ratio of the entire CAI to the N-terminal CAI was estimated for all effectors and for all other genes. There were 32/40 (80.0%) cases of known LT2 effectors in which the ratio exceeded 1. To compare values in non-effector genes with alien origins, we selected genes with similar GC content values to those of known effectors. A total of 651 genes with GC contents between 0.38 and 0.48 were selected. Of these, the ratio exceeded a value of 1 in 395 (60.7%) cases. The number of cases in which the ratio exceeded 1 in the effector group was significantly greater than that in the low GC genes (Fisher's exact test, p-value = 0.0094). We also performed window analysis of non-optimal codon usage in the N-terminal region and found that codon de-optimisation was especially prominent in the region between 1 and 32 in the group of known effectors (See Additional file 5 Supp_Doc_CAI.doc). De-optimisation was more prevalent in the known effectors than in putative alien genes. Interestingly, the distribution of non-optimal codon usage in the N-terminal region showed a similar tendency to that of the putative substrate of the Sec translocon. Kampenusa et al. recently reported that the CAI was useful for discriminating among substrates from four different types (I, III, IV, and VI) of secretion systems . The present study revealed that codon bias was especially prominent in the N-terminal region of the secretion substrates. Therefore, codon de-optimisation may stem from a specific translocation mechanism. This characteristic has also been described for the substrate of the Sec-dependent translocon [25, 26]. One possibility is that slow translation of the secretion substrate may be needed for efficient co-translational translocation or for protection against the proteolytic degradation of proteins with disordered N-termini.
Increase in enrichment of known effectors in the top ranking by secondary filtering
Secondary filtering by coexpression analysis can be applied only to organisms with at least one expression dataset deposited in a public database. Hence, this filter cannot be applied in the vast majority of de novo assembled genomes. Although we assembled 11 and 4 expression datasets from LT2 and DC3000, respectively, enrichment of true positives in the top-ranking of the first part of the SVM analysis can be improved with as little as one expression dataset with 10 or more sample slides (See Additional file 8 Supp_Doc_ExpRequirement.doc). Since it is expected that the deposition of expression data will increase at a rapid rate owing to the tremendous progress in next-generation sequencing technologies (e.g. RNASeq whole cell transcriptome analysis), heuristic filtering by coexpression analysis will be more feasible in the future screening of virulence effectors.
Putative model effectors predicted by this system and further assessment by recently found effectors
Putative novel effectors of serovar Typhimurium predicted by the system
pathogenicity island encoded protein: SPI3
putative cytoplasmic protein
putative cytoplasmic protein
putative cytoplasmic protein
putative TPR repeat protein
putative serine/threonine protein kinase
putative cytoplasmic protein
putative inner membrane protein
putative cytoplasmic protein
putative outer membrane protein
putative inner membrane protein
putative inner membrane protein
putative coiled-coil protein
putative cytoplasmic protein
putative glucose-6-phosphate dehydrogenase
putative outer membrane protein
chaparone, related to virulence
putative periplasmic protein
Gifsy-1 prophage: similar to transpose
putative periplasmic protein
Surface presentation of antigens; secretory proteins
Homology to invasin C of Yersinia; intimin
putative cell wall-associated hydrolase
Because the exact number of unidentified effectors in the genome is unknown, it is possible that the highly ranked not-known-effector genes are actually true effectors. Hence, the enrichment of known effectors in the top ranking only does not indicate the predictive performance. This is the main difficulty in assessing the performance of effector prediction. However, the high rankings for recently identified effectors, taken together with the enrichment of known effectors in our validation set, suggest that the results of our prediction approach a complete catalogue of effectors; at least, we could make an almost-complete candidate list for identifying effectors that have common characteristics with known TTSS substrates.
In the present study, we confirmed that our system showed significant improvement over existing methods and revealed several novel discriminant features. However, not all of the revealed features, including previously reported ones, were specific enough to precisely determine the substrate, i.e. were clear recognition signals. Construction of prediction tools should support the deciphering of the recognition mechanisms of the secreted proteins through the implication of specific recognition signals or precise recognition principles. It is speculated that such signals, related to translocation mechanisms, are encoded at the three-dimensional level, especially considering the failure to detect a common motif in the primary sequence of TTSS substrates, in spite of recent advances in prediction tools. The use of structural informatics to further refine the prediction system is considered to be a promising approach for the future development of prediction tools.
We developed a meta-analytic approach to predicting virulence effectors accurately by integrating discrimination features derived from the genome sequence information and DNA microarray experimental data. Our analysis consisted of two parts. The first, based on SVM learning, is an approach developed through modification of existing tools. In this SVM-based analysis, new parameters were introduced as follows: (i) N-terminal flexibility estimation by the POODLE-S program, (ii) structure-related parameters estimated by ProtParam, such as grand average of hydropathicity(GRAVY score), and (iii) codon adaptation index. The introduction of these new parameters refined the discriminatory power of the tool. The use of N-terminal flexibility as a determinant of TTSS substrate status has been controversial. In this study, we confirmed that the incorporation of accurate assessment of N-terminal flexibility genuinely refined the prediction power, which supports the hypothesis that N-terminal flexibility is an important feature of the TTSS substrate. The second part of our analytical framework, additional filtering through coexpression analysis using the DNA microarray data in the GEO database and functional domain distribution analysis, further refined the predictive power of our system. In our benchmark test, the number of true positives from 20 known effectors in the top 20 ranking reached 12.3 on average. Hence, our system has good predictive power that is sufficient for candidate selection, which should then be followed by thorough, gene-targeted experimental verification. Furthermore, the putative effectors predicted by our system in the LT2 contained many hypothetical genes and genes with virulence annotation, indicating additional novel effectors in the Salmonella genome. In addition to the successful construction of this system, we also revealed intriguing features of effectors, namely that N-terminal codon usage is significantly de-optimized and that the N-terminal region is predicted to take on a highly flexible structure in these proteins.
Construction of the gold-standard validation set
Known effectors of Salmonella enterica serovar Typhimurium LT2 and P. syringae DC3000 were assembled on the basis of the literature (See Additional file 1 Supp_Table_knownEffector.xls). In total, 40 and 28 known effectors were annotated for LT2 and DC3000, respectively. The annotation data for all 4550 protein-coding genes for LT2 and 5619 protein-coding genes for DC3000 were downloaded from the KEGG GENES database . All genes except for known effectors were treated as negative examples in the validation process. To assess the accuracy of prediction, all positive and negative examples were separated into two sets, the training set and the test set. The ratio of positives to negatives in the training set was set at 1:20, according to the study by Samudrala . The training set was used for the learning of discrimination features and the trained discrimination device was then applied to the rest of the known effectors in the test set. All of the negative examples other than known effectors were included in the test set. The prediction accuracy is highly dependent on the selection of the negative training set. Therefore, 10 sets of training and testing examples were randomly selected (Set1 ~ Set10) and the AUC values were averaged.
Accurate annotation of the translation initiation site
Because the signal in the N-terminal region is also one of the important features in our prediction system, the effect of translation initiation site accuracy for effector prediction was first assessed. The translation initiation sites (TISs) were re-annotated by integrating three different TIS annotations: those from the KEGG database, ProTISA, and geneFinder. The TISs of two known effectors, STM1026 and STM2088, were re-annotated. Experimentally validated TISs for these two effectors have not yet been revealed. Regarding STM1026, our previous experiment assessing the translocation of STM1026 suggests that the translocation of this effector requires a re-annotated and elongated, i.e., upstream, ORF (our recent result, Takaya et al., submitted). As for the ORF annotation of DC3000, such incorrect annotation was not observed for known effectors. Therefore, annotation from the KEGG GENES database was used for the SVM analysis for DC3000.
Coexpression analysis using microarray data of GEO datasets
Transcriptome profiling datasets were assembled and all-versus-all comparisons of gene expression similarity were conducted. Eleven and four datasets were selected for the analysis of LT2 and DC3000, respectively (See Additional file 10 Supp_Table_GEODataSet.xls). Firstly, expression data were downloaded from GEO and were normalized to one dataset by Z-score. The similarity of expression profiles between two genes was calculated by Pearson correlation. The expression similarity scores (correlation coefficients) against the selected known effectors from the training set were averaged and defined as a score of coexpression with the known effectors. In serovar Typhimurium, the expression of effector proteins is controlled by two different systems, SPI-1 and SPI-2. Therefore, the effector proteins were divided into two groups according to the regulatory system, and coexpression analysis was conducted separately. The coexpression pattern was clearly observed among known effectors of LT2 (Additional file 6 Supp_Table_CoEXP.xls). Almost all of the known effectors ranked in the top 300 in the SPI-1 or SPI-2 assessment, suggesting that coexpression analysis could successfully discern the tendency of effector genes to be coexpressed.
Estimation of each parameter used in the SVM analysis
Molecular weight, charge, predicted pI, instability index, aliphatic index, and GRAVY score were calculated using the ProtParam web server . The instability index of the 25 N-terminal aa was estimated using the POODLE-S web server. The "missing residues" parameter was used for the analysis. For codon adaptation index (CAI), the reference codon tables for serovar Typhimurium (Esty.cut) and P. syringae DC3000 (Epsesm.cut) prepared in EMBOSS suite version 2.5.0  were used. The dN/dS ratio was estimated against the multiple alignment of orthologous genes annotated from all available Salmonella genomes in the KEGG database. The codeml module of the PAML package, version 4.2, was used for the analysis . Amino acid composition, position specific matrix, and phylogeny matrices were estimated as described in SIEVE . As for the phylogeny matrices, several points were modified. A total of 1172 organisms from the KEGG ORGANISMS database were used for the analysis. The identity values for assigned orthologous genes were treated as parameters for 1172 dimensions. If orthologous gene was not assigned for the given organism, the value was set to zero. The reciprocal orthology annotation in KEGG Sequence Similarity Database (SSDB), which is a database of orthologous gene annotation, was used. The organisms used for the analysis are listed in the Additional file 11 (Supp_Table_Organisms.xls). The parameters used in the SVM analysis are summarized in the Additional file 12 (Supp_Table_FeatParms.xls).
Validation set construction, kernels and parameters used in the SVM analysis
In this study, validation was performed on a genome-wide scale. Therefore, the test set consisted of all of the genes from the respective genomes. The training set consisted of a randomly selected positive set with half of the known effectors. The remaining half was used for validating the prediction power. The negative examples for the training set were selected randomly from the proteome. The area under the curve (AUC), the average true positive count in top 20 ranking and the mean rank of the positive examples averaged for 10 validation sets (we call the score RANKavg) were estimated and were used to assess accuracy. The radial kernel function with width factor of 1.0 was used according to the benchmark test for LT2 model (See Additional file 13 Supp_Doc_KernelOptimisation.doc). The seed for the random value was set to 76543. Other parameters were set to default values. SVM analysis and ROC estimation were performed using the Gist package 2.3 .
Determination of thresholds for secondary filtering
Annotation of functional domains commonly observed in bacteria
The functional domain annotation and links to the InterPro database were obtained from the KEGG GENES database. The domain distribution information was extracted from the InterPro database. If the given domain annotation appeared over 5000 times in the bacterial species, the domain was judged to be a commonly conserved domain among bacteria. The threshold of 5000 was selected to cover all of the known effectors.
BLAST homology search for the removal of secretion apparatus genes
First, a set of all secretion apparatus proteins from the KEGG GENES database was collected by keyword search. The keywords used were "apparatus AND secretion AND type" to select genes of all types of secretion apparatuses. In total, 599 genes were selected by the search. To avoid a simple self-hit, all Salmonella and Pseudomonas genes were removed from the list. In the BLASTP search, if the protein had any hit with an e-value less than 10-5, the protein was judged to be an apparatus protein.
The SVM portion of the prediction system is implemented by the wrapper program, which integrates various bioinformatics tools running on the Unix platform and various web application programming interfaces (APIs). The wrapper program is written in Perl and can be run on a Unix-based machine. The source code is available from our web site http://www.p.chiba-u.ac.jp/lab/bisei/software/index.html. The user-friendly web-server version of the prediction system is now under construction and will be available at the above site in the near future.
Acknowledgements and funding
We would like to thank C. Samizo and Y. Mitagami for their technical assistance. This work was supported by a Grant-in-Aid for Scientific Research to T. Yamamoto (22390080) from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government. Theoretical calculations were partly performed using the Research Centre for Computational Science, Okazaki, Japan.
- Durand E, Verger D, Rego AT, Chandran V, Meng G, Fronzes R, Waksman G: Structural biology of bacterial secretion systems in gram-negative pathogens--potential for new drug targets. Infect Disord Drug Targets 2009, 9(5):518–547. 10.2174/187152609789105722View ArticlePubMedGoogle Scholar
- Saier MH Jr: Protein secretion and membrane insertion systems in gram-negative bacteria. J Membr Biol 2006, 214(2):75–90. 10.1007/s00232-006-0049-7View ArticlePubMedGoogle Scholar
- Filloux A, Hachani A, Bleves S: The bacterial type VI secretion machine: yet another player for protein transport across membranes. Microbiology 2008, 154(Pt 6):1570–1583.View ArticlePubMedGoogle Scholar
- Petnicki-Ocwieja T, Schneider DJ, Tam VC, Chancey ST, Shan L, Jamir Y, Schechter LM, Janes MD, Buell CR, Tang X, Collmer A, Alfano JR: Genomewide identification of proteins secreted by the Hrp type III protein secretion system of Pseudomonas syringae pv. tomato DC3000. Proc Natl Acad Sci USA 2002, 99(11):7652–7657. 10.1073/pnas.112183899PubMed CentralView ArticlePubMedGoogle Scholar
- Cunnac S, Lindeberg M, Collmer A: Pseudomonas syringae type III secretion system effectors: repertoires in search of functions. Curr Opin Microbiol 2009, 12(1):53–60. 10.1016/j.mib.2008.12.003View ArticlePubMedGoogle Scholar
- Schechter LM, Vencato M, Jordan KL, Schneider SE, Schneider DJ, Collmer A: Multiple approaches to a complete inventory of Pseudomonas syringae pv. tomato DC3000 type III secretion system effector proteins. Mol Plant Microbe Interact 2006, 19(11):1180–1192. 10.1094/MPMI-19-1180View ArticlePubMedGoogle Scholar
- Tobe T, Beatson SA, Taniguchi H, Abe H, Bailey CM, Fivian A, Younis R, Matthews S, Marches O, Frankel G, Hayashi T, Pallen MJ: An extensive repertoire of type III secretion effectors in Escherichia coli O157 and the role of lambdoid phages in their dissemination. Proc Natl Acad Sci USA 2006, 103(40):14941–14946. 10.1073/pnas.0604891103PubMed CentralView ArticlePubMedGoogle Scholar
- Deng W, de Hoog CL, Yu HB, Li Y, Croxen MA, Thomas NA, Puente JL, Foster LJ, Finlay BB: A comprehensive proteomic analysis of the type III secretome of Citrobacter rodentium . J Biol Chem 2010, 285(9):6790–6800. 10.1074/jbc.M109.086603PubMed CentralView ArticlePubMedGoogle Scholar
- Geddes K, Worley M, Niemann G, Heffron F: Identification of new secreted effectors in Salmonella enterica serovar Typhimurium. Infect Immun 2005, 73(10):6260–6271. 10.1128/IAI.73.10.6260-6271.2005PubMed CentralView ArticlePubMedGoogle Scholar
- Niemann GS, Brown RN, Gustin JK, Stufkens A, Shaikh-Kidwai AS, Li J, McDermott JE, Brewer HM, Schepmoes A, Smith RD, Adkins JN, Heffron F: Discovery of novel secreted virulence factors from Salmonella enterica serovar Typhimurium by proteomic analysis of culture supernatants. Infect Immun 2011, 79(1):33–43. 10.1128/IAI.00771-10PubMed CentralView ArticlePubMedGoogle Scholar
- Yoon H, Ansong C, Adkins JN, Heffron F: Discovery of Salmonella Virulence Factors Translocated via Outer Membrane Vesicles to Murine Macrophages. Infect Immun 2011, 79(6):2182–2192. 10.1128/IAI.01277-10PubMed CentralView ArticlePubMedGoogle Scholar
- Vinatzer BA, Jelenska J, Greenberg JT: Bioinformatics correctly identifies many type III secretion substrates in the plant pathogen Pseudomonas syringae and the biocontrol isolate P. fluorescens SBW25. Mol Plant Microbe Interact 2005, 18(8):877–888. 10.1094/MPMI-18-0877View ArticlePubMedGoogle Scholar
- Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet 2010, 11(1):31–46. 10.1038/nrg2626View ArticlePubMedGoogle Scholar
- Eswarappa SM, Janice J, Balasundaram SV, Dixit NM, Chakravortty D: Host-specificity of Salmonella enterica serovar Gallinarum: insights from comparative genomics. Infect Genet Evol 2009, 9(4):468–473. 10.1016/j.meegid.2009.01.004View ArticlePubMedGoogle Scholar
- Wang Y, Zhang Q, Sun MA, Guo D: High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics 2011, 27(6):777–784. 10.1093/bioinformatics/btr021View ArticlePubMedGoogle Scholar
- Arnold R, Brandmaier S, Kleine F, Tischler P, Heinz E, Behrens S, Niinikoski A, Mewes HW, Horn M, Rattei T: Sequence-based prediction of type III secreted proteins. PLoS Pathog 2009, 5(4):e1000376. 10.1371/journal.ppat.1000376PubMed CentralView ArticlePubMedGoogle Scholar
- Samudrala R, Heffron F, McDermott JE: Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems. PLoS Pathog 2009, 5(4):e1000375. 10.1371/journal.ppat.1000375PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Y, Zhao J, Morgan RL, Ma W, Jiang T: Computational prediction of type III secreted proteins from gram-negative bacteria. BMC Bioinformatics 2010, 11(Suppl 1):S47. 10.1186/1471-2105-11-S1-S47PubMed CentralView ArticlePubMedGoogle Scholar
- Lower M, Schneider G: Prediction of type III secretion signals in genomes of gram-negative bacteria. PLoS One 2009, 4(6):e5917. 10.1371/journal.pone.0005917PubMed CentralView ArticlePubMedGoogle Scholar
- Kim JG, Park BK, Yoo CH, Jeon E, Oh J, Hwang I: Characterization of the Xanthomonas axonopodis pv. glycines Hrp pathogenicity island. J Bacteriol 2003, 185(10):3155–3166. 10.1128/JB.185.10.3155-3166.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Burstein D, Zusman T, Degtyar E, Viner R, Segal G, Pupko T: Genome-scale identification of Legionella pneumophila effectors using a machine learning approach. PLoS Pathog 2009, 5(7):e1000508. 10.1371/journal.ppat.1000508PubMed CentralView ArticlePubMedGoogle Scholar
- Sharp PM, Li WH: The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 1987, 15(3):1281–1295. 10.1093/nar/15.3.1281PubMed CentralView ArticlePubMedGoogle Scholar
- Gouy M, Gautier C: Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res 1982, 10(22):7055–7074. 10.1093/nar/10.22.7055PubMed CentralView ArticlePubMedGoogle Scholar
- Sharp PM, Li WH: An evolutionary perspective on synonymous codon usage in unicellular organisms. J Mol Evol 1986, 24(1–2):28–38. 10.1007/BF02099948View ArticlePubMedGoogle Scholar
- Power PM, Jones RA, Beacham IR, Bucholtz C, Jennings MP: Whole genome analysis reveals a high incidence of non-optimal codons in secretory signal sequences of Escherichia coli . Biochem Biophys Res Commun 2004, 322(3):1038–1044. 10.1016/j.bbrc.2004.08.022View ArticlePubMedGoogle Scholar
- Li YD, Li YQ, Chen JS, Dong HJ, Guan WJ, Zhou H: Whole genome analysis of non-optimal codon usage in secretory signal sequences of Streptomyces coelicolor . Biosystems 2006, 85(3):225–230. 10.1016/j.biosystems.2006.02.006View ArticlePubMedGoogle Scholar
- Shimizu K, Hirose S, Noguchi T: POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics 2007, 23(17):2337–2338. 10.1093/bioinformatics/btm330View ArticlePubMedGoogle Scholar
- Joly DL, Feau N, Tanguay P, Hamelin RC: Comparative analysis of secreted protein evolution using expressed sequence tags from four poplar leaf rusts ( Melampsora spp.). BMC Genomics 2010, 11: 422. 10.1186/1471-2164-11-422PubMed CentralView ArticlePubMedGoogle Scholar
- Ma W, Dong FF, Stavrinides J, Guttman DS: Type III effector diversification via both pathoadaptation and horizontal transfer in response to a coevolutionary arms race. PLoS Genet 2006, 2(12):e209. 10.1371/journal.pgen.0020209PubMed CentralView ArticlePubMedGoogle Scholar
- Buchan DW, Ward SM, Lobley AE, Nugent TC, Bryson K, Jones DT: Protein annotation and modelling servers at University College London. Nucleic Acids Res 38(Web Server):W563–568.Google Scholar
- Ouali M, King RD: Cascaded multiple classifiers for secondary structure prediction. Protein Sci 2000, 9(6):1162–1176. 10.1110/ps.9.6.1162PubMed CentralView ArticlePubMedGoogle Scholar
- McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics 2000, 16(4):404–405. 10.1093/bioinformatics/16.4.404View ArticlePubMedGoogle Scholar
- Buchko GW, Niemann G, Baker ES, Belov ME, Smith RD, Heffron F, Adkins JN, McDermott JE: A multi-pronged search for a common structural motif in the secretion signal of Salmonella enterica serovar Typhimurium type III effector proteins. Mol Biosyst 2010, 6(12):2448–2458. 10.1039/c0mb00097cPubMed CentralView ArticlePubMedGoogle Scholar
- Kampenusa I, Zikmanis P: Distinguishable codon usage and amino acid composition patterns among substrates of leaderless secretory pathways from proteobacteria. Appl Microbiol Biotechnol 2010, 86(1):285–293. 10.1007/s00253-009-2423-8View ArticlePubMedGoogle Scholar
- Subtil A, Delevoye C, Balana ME, Tastevin L, Perrinet S, Dautry-Varsat A: A directed screen for chlamydial proteins secreted by a type III mechanism identifies a translocated protein and numerous other new candidates. Mol Microbiol 2005, 56(6):1636–1647. 10.1111/j.1365-2958.2005.04647.xView ArticlePubMedGoogle Scholar
- Takaya A, Suzuki M, Matsui H, Tomoyasu T, Sashinami H, Nakane A, Yamamoto T: Lon, a stress-induced ATP-dependent protease, is critically important for systemic Salmonella enterica serovar typhimurium infection of mice. Infect Immun 2003, 71(2):690–696. 10.1128/IAI.71.2.690-696.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Takaya A, Tomoyasu T, Tokumitsu A, Morioka M, Yamamoto T: The ATP-dependent lon protease of Salmonella enterica serovar Typhimurium regulates invasion and expression of genes carried on Salmonella pathogenicity island 1. J Bacteriol 2002, 184(1):224–232. 10.1128/JB.184.1.224-232.2002PubMed CentralView ArticlePubMedGoogle Scholar
- Cirillo DM, Valdivia RH, Monack DM, Falkow S: Macrophage-dependent induction of the Salmonella pathogenicity island 2 type III secretion system and its role in intracellular survival. Mol Microbiol 1998, 30(1):175–188. 10.1046/j.1365-2958.1998.01048.xView ArticlePubMedGoogle Scholar
- McDermott JE, Corrigan A, Peterson E, Oehmen C, Niemann G, Cambronne ED, Sharp D, Adkins JN, Samudrala R, Heffron F: Computational prediction of type III and IV secreted effectors in gram-negative bacteria. Infect Immun 2011, 79(1):23–32. 10.1128/IAI.00537-10PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M: The KEGG database. Novartis Found Symp 2002, 247: 91–101. discussion 101–103, 119–128, 244–152 discussion 101-103, 119-128, 244-152View ArticlePubMedGoogle Scholar
- Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, Appel RD, Hochstrasser DF: Protein identification and analysis tools in the ExPASy server. Methods Mol Biol 1999, 112: 531–552.PubMedGoogle Scholar
- Olson SA: EMBOSS opens up sequence analysis. European Molecular Biology Open Software Suite. Brief Bioinform 2002, 3(1):87–91. 10.1093/bib/3.1.87View ArticlePubMedGoogle Scholar
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 1997, 13(5):555–556.PubMedGoogle Scholar
- Pavlidis P, Wapinski I, Noble WS: Support vector machine classification on the web. Bioinformatics 2004, 20(4):586–587. 10.1093/bioinformatics/btg461View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.