T3SEdb: data warehousing of virulence effectors secreted by the bacterial Type III Secretion System
- Daniel Ming Ming Tay†1,
- Kunde Ramamoorthy Govindarajan†1,
- Asif M Khan1,
- Terenze Yao Rui Ong1,
- Hanif M Samad1,
- Wei Wei Soh1,
- Minyan Tong1,
- Fan Zhang1 and
- Tin Wee Tan1Email author
© Tay et al; licensee BioMed Central Ltd. 2010
Published: 15 October 2010
Effectors of Type III Secretion System (T3SS) play a pivotal role in establishing and maintaining pathogenicity in the host and therefore the identification of these effectors is important in understanding virulence. However, the effectors display high level of sequence diversity, therefore making the identification a difficult process. There is a need to collate and annotate existing effector sequences in public databases to enable systematic analyses of these sequences for development of models for screening and selection of putative novel effectors from bacterial genomes that can be validated by a smaller number of key experiments.
Herein, we present T3SEdb http://effectors.bic.nus.edu.sg/T3SEdb, a specialized database of annotated T3SS effector (T3SE) sequences containing 1089 records from 46 bacterial species compiled from the literature and public protein databases. Procedures have been defined for i) comprehensive annotation of experimental status of effectors, ii) submission and curation review of records by users of the database, and iii) the regular update of T3SEdb existing and new records. Keyword fielded and sequence searches (BLAST, regular expression) are supported for both experimentally verified and hypothetical T3SEs. More than 171 clusters of T3SEs were detected based on sequence identity comparisons (intra-cluster difference up to ~60%). Owing to this high level of sequence diversity of T3SEs, the T3SEdb provides a large number of experimentally known effector sequences with wide species representation for creation of effector predictors. We created a reliable effector prediction tool, integrated into the database, to demonstrate the application of the database for such endeavours.
T3SEdb is the first specialised database reported for T3SS effectors, enriched with manual annotations that facilitated systematic construction of a reliable prediction model for identification of novel effectors. The T3SEdb represents a platform for inclusion of additional annotations of metadata for future developments of sophisticated effector prediction models for screening and selection of putative novel effectors from bacterial genomes/proteomes that can be validated by a small number of key experiments.
The Type III Secretion System (T3SS) is an essential mechanism for host-pathogen interaction during the infection process and is found in many gram-negative bacteria pathogens and eukaryotic cell symbionts . Examples include Yersinia spp., Salmonella spp., Burkholderia, Pseudomonas and Chlamydia . The T3SS machinery is a highly conserved multi-protein apparatus that mediates the delivery of bacterial effector proteins into the host cell . T3SS effector (T3SE) proteins act as virulence factors within the host and are able to alter and manipulate vital host cell functions, such as signal transduction  and innate immune response .
Due to the key role of T3SE proteins in the establishment and maintenance of bacterial pathogenicity, there is considerable research interest in the identification of T3SS effectors. However T3SEs display high level of sequence diversity, due largely to horizontal gene transfer among evolutionarily distant species and subsequent bacterial adaptation to different host cell environments . To date, while quite a number of T3SEs have been identified by both in vitro and in silico methods, the rising number of effector sequences being discovered each year suggests that this represents only a small proportion of all effectors, with many more yet to be discovered.
There is a need to collate and annotate these known effector sequences to enable systematic analyses of these sequences for development of models for screening and selection of putative novel effectors from bacterial genomes/proteomes that can be validated by a small number of key experiments. There is no publicly available specialized database of T3SEs, although databases exist for the T3SS machinery, such as the Database of Type 3 Secretion System (DTTSS) . Herein, we present T3SEdb, a specialized database of annotated T3SS effectors, Web-accessible at http://effectors.bic.nus.edu.sg/T3SEdb. By using the reported annotated repertoire of effectors in the database, we have built a reliable T3SS effector prediction model that may be useful for predicting T3SS effectors expressed in a broad spectrum of bacterial species.
Results and discussion
T3SEdb contains 1089 cross-referenced and manually annotated effector records (as of April 2010), of which 504 are experimentally verified (E), 572 hypothetical (H) and 13 unknown (U), originating from a total of 46 bacterial species. Nine bacterial species (Escherichia coli, Salmonella enteric, Citrobacter rodentium, Pseudomonas syringae, Yersinia pestis, Chlamydia trachomatis, Shigella flexneri, Yersinia enterocolitica and Burkholderia pseudomallei) had more than 10, both experimentally verified and hypothetical effector sequences, with Escherichia coli having the most reported.
T3SEdb records are built on effector sequence records retrieved using various keywords from the NCBI Entrez Protein database. The original records were manually checked to remove irrelevant records and the retained records were processed to customise the data for the specialised T3SEdb by keeping only fields of interest (list of fields is provided at: http://effectors.bic.nus.edu.sg/T3SEdb/fielddescription.php). A T3SEdb record is assigned a unique five character identifier, which consists of a single letter "T" followed by four numeric digits. The experimental status of each record (either E, H or U) was defined following the comprehensive annotation procedure that we defined http://effectors.bic.nus.edu.sg/T3SEdb/annotationpolicy.php, which involved manually scanning through the literature via PubMed , cross-referencing functional annotations in corresponding records of the effectors in the UniProt/Swiss-Prot database , and performing BLAST  search against the non-redundant (nr) sequences database.
Features of T3SEdb
Diversity of T3SS effectors
Prediction of effectors using machine learning algorithms
Machine learning approaches have been used to create tools for prediction of diverse T3SS effectors based on physico-chemical properties, such as hydrophobicity and polarity, in their N-terminal region [11–13], suggesting that these properties are conserved in this region and encode key functional signals to discriminate effectors from non-effectors. Thus the N-terminal region and the inherent physico-chemical properties together with complex machine learning approaches represent attractive avenues for strategies to design and develop T3SS effector prediction models. T3SEdb provides a large number of experimentally known effector sequences with wide species representation for creation of effector predictors which may be useful for scanning of genomes of broad spectrum of bacterial species for discovery of novel T3SEs. We created an effector prediction system to demonstrate the application of the database for such endeavours.
We focused on the 100 amino acids (aa) region of the N-terminal of the experimentally known effectors (positive dataset of 100 sequences from 28 species in the database) and non-effector protein sequences (negative dataset of 100 sequences from 10 species) for the development of the predictor using machine learning methods in the Waikato Environment for Knowledge Analysis (WEKA, version 3.6.2) . Three physico-chemical properties, namely hydrophobicity , polarity  and β-turns  were studied and their scores were ascribed to the overlapping peptides of window size 9 within the 100aa region for both the positive and negative datasets. The application of the physico-chemical property β-turn has not been reported elsewhere and was included because proline residues, which suggest presence of β-turns, are reported to be significantly enriched in effectors of animal pathogens , and we also observed enrichment of proline in the N-terminal region of many of the known effectors in the database. Proline residues represented ~5.46% of the amino acids over the 100aa N-terminal region of experimentally validated effectors, which is higher than the UniProt/Swiss-Prot database statistics  of ~4.69% for proline.
Performance measure of binary classifiers in WEKA for prediction of T3SEs.
Bayesian Logistic Regression
Support vector machines (SVM)
The model was also tested for its usefulness in scanning bacterial genomes for novel effectors. However, the test against a dataset approximately proportionate to the ratio of effectors and non-effector protein sequences in a bacterial genome (positive dataset of 49 sequences from 14 species and negative dataset of 929 from 16 species) returned a low positive predictive value (PPV: proportion of true positives over the predicted positives) of ~17%. PPV is a more relevant measure for researchers working in the wet-lab validating predictions because a model with a high PPV would directly result in a significant reduction in effector discovery cost. Though the low PPV for our model is expected given the small proportion of effectors in the bacterial genome, it highlights the challenges facing researchers in the field to develop more sophisticated prediction tools utilizing assemblage of voting of combinations of additional metadata as prediction features to discriminate effectors from non-effectors for practical application in the scanning of bacterial genomes for novel effectors. Examples of such metadata for development of metapredictors may include additional features/criteria such as lower rate of evolution and aggregation propensity, which are characteristics of substrates of chaperones  (T3SEs require chaperones for optimal delivery and/or expression ), clinical phenotype, host type (plant versus animal), quality of experimental methods used to identify the effectors and BLAST search score of known effectors against bacterial genomes, among others. The T3SEdb represents an excellent platform for inclusion of annotations of such metadata for future developments of sophisticated effector prediction tools applicable for genome scan.
We have created T3SEdb, the first reported specialised database of T3SS effectors enriched with annotations that facilitated systematic construction of a reliable prediction model for identification of novel effectors. It represents a platform for future developments of sophisticated metapredictors for practical application in the scanning of bacterial genomes for novel effectors.
Bacterial T3SE sequences were collected from the NCBI Entrez Protein database  via keyword search. Keyword search was restricted to bacterial sequences and several synonyms of T3SEs (such as Type Three Secretion System effector, Type 3 Secretion System effector, Type III Secretion System effector, TTSS effector, T3SS effector, Type Three Secretion Effector, Type 3 Secretion Effector, Type III Secretion Effector, Type 3 Secreted Effector, Type III Secreted Effector, Type Three Secreted Effector, T3SE, and TTSE) were included in the search to maximise the number of records picked up. The records were downloaded in XML format with the relevant annotation details in the records extracted and tabulated using in house BASH scripts. These were then manually assessed to remove irrelevant records and annotated according to their experimental validation status: E for experimentally confirmed, H for hypothetical or U for unknown.
Modelling T3SS effector predictor using the annotated data of T3SEdb
Since the prediction was focused on the 100aa N-terminal region of the effector proteins, 8 of the 504 experimentally verified sequences that were shorter than 100 amino acids were removed, resulting in 496 sequences available for analysis. Duplicates were then removed from the remaining sequences to obtain a unique set of 260 sequences. The unique sequences were then clustered using Blastclust  at 70% identity threshold, which returned a total of 168 clusters. This was done for better data generalization and to minimize data bias in terms of over-fitting by presence of highly similar sequences and/or by over-representation of data of a particular or a few species. A sequence from each cluster (representing the cluster) was used to form the positive dataset for training of the predictor. A total of 100 non-effector protein sequences, trimmed to their N-terminal 100 amino acids, were used as the negative dataset for the model training. These negative sequences were randomly selected from 10 bacterial species, namely Citrobacter rodentium, Escherichia coli, Pseudomonas syringae, Pseudomonas tolaasii, Salmonella agona, Salmonella choleraesuis, Salmonella enterica, Salmonella typhi, Salmonella typhimurium, and Yersinia pestis.
Thereafter, overlapping nonamers of all the sequences from the positive and negative datasets were scored using Protscale  for three physico-chemical properties: hydrophobicity using Eisenberg et al. scale , polarity using the Grantham scale  and β-turns using the Levitt scale . This scoring of the nonamers' center position was automated and the original score scale was standardized into a Z-score. The resulting output was 92 features for each individual physico-chemical property, ascribed to each sequence in both the positive and negative datasets.
Following this feature assignment process from Protscale, the features were imported into WEKA for machine learning analysis. WEKA is an integrated package of machine learning algorithms and it provides users with a variety of binary classifiers (algorithms) that can serve as predictors, thus allowing efficient comparison of the different algorithms according to various performance measures after cross-validation. Prior to classification, feature selection using a greedy stepwise algorithm  was used to select a reduced feature set of the individual physico-chemical properties. The 92 individual features generated for hydrophobicity, polarity and β-turns were reduced to a total of 63 combined features after feature selection. The performance of a number of classifiers (default parameters setting used) was measured for their ability to classify effectors and non-effectors using the reduced feature set. We performed 10-fold cross-validation on the training dataset (100 effectors and 100 non-effectors) and used the value of the Aroc to compare the performance of the available classifiers. They were then validated using the balanced test dataset of 68 effector and 68 non-effector protein sequences that were not part of the training data for performance measure of Aroc, sensitivity and specificity. The model was also tested against a dataset approximately proportionate to the ratio of effectors (~5% - perhaps an over-estimate) and non-effector proteins in a bacterial genome (~95% - perhaps an under-estimate) to estimate the positive predictive value (PPV).
The authors thank Associate Professor Gan Yunn Hwen, Dr Sun Guang Wen, Mr Mark De Silva, Mr Lim Kuan Siong, Mr Xie Chao, Ms Lim Shen Jean, Mr Pawan Kumar Patro and Ms Natascha May Thevasagayam for their help and valuable suggestions as part of the teaching and advisory staff involved in the LSM3241 module on Bioinformatics at the National University of Singapore. We thank EUAsiaGrid (under the project ID RI 223791, co-funded by the European Commission as a Coordinated and Support Action within the 7th Framework Programme) for their support in allowing us to use their computational resources via the Grid portal at http://trg.apbionet.org/euasiagrid/ for a number of our intensive data processing and analysis needs.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 7, 2010: Ninth International Conference on Bioinformatics (InCoB2010): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S7.
- Mota LJ, Sorg I, Cornelis GR: Type III secretion: the bacteria-eukaryotic cell express. FEMS Microbiol Lett 2005, 252(1):1–10. 10.1016/j.femsle.2005.08.036View ArticlePubMedGoogle Scholar
- He SY, Nomura K, Whittam TS: Type III protein secretion mechanism in mammalian and plant pathogens. Biochim Biophys Acta 2004, 1694(1–3):181–206. 10.1016/j.bbamcr.2004.03.011View ArticlePubMedGoogle Scholar
- Spreter T, Yip CK, Sanowar S, André I, Kimbrough TG, Vuckovic M, Pfuetzner RA, Deng W, Yu AC, Finlay BB, Baker D, Miller SI, Strynadka NCJ: A conserved structural motif mediates formation of the periplasmic rings in the type III secretion system. Nat Struct Mol Biol 2009, 16(5):468–476. 10.1038/nsmb.1603PubMed CentralView ArticlePubMedGoogle Scholar
- Schroeder GN, Hilbi H: Molecular pathogenesis of Shigella spp.: controlling host cell signaling, invasion, and death by type III secretion. Clin Microbiol Rev 2008, 21(1):134–156. 10.1128/CMR.00032-07PubMed CentralView ArticlePubMedGoogle Scholar
- Engel J, Balachandran P: Role of Pseudomonas aeruginosa type III effectors in disease. Curr Opin Microbiol 2009, 12(1):61–66. 10.1016/j.mib.2008.12.007View ArticlePubMedGoogle Scholar
- Ma W, Guttman DS: Evolution of prokaryotic and eukaryotic virulence effectors. Curr Opin Plant Biol 2008, 11(4):412–419. 10.1016/j.pbi.2008.05.001View ArticlePubMedGoogle Scholar
- DTTSS: Database of Type III Secretion System[http://sdbi.sdut.edu.cn/ttss/]
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, Marchler-Bauer A, Miller V, Mizrachi I, Ostell J, Panchenko A, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Slotta D, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Wang Y, John Wilbur W, Yaschenko E, Ye J: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2010, (38 Database):D5–16. 10.1093/nar/gkp967Google Scholar
- UniProt Consortium: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 2010, (38 Database):D142-D148. 10.1093/nar/gkp846Google Scholar
- McGinnis S, Madden TL: BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 2004, (32 Web Server):W20-W25. 10.1093/nar/gkh435Google Scholar
- Arnold R, Brandmaier S, Kleine F, Tischler P, Heinz E, Behrens S, Niinikoski A, Mewes HW, Horn M, Rattei T: Sequence-based prediction of type III secreted proteins. PLoS Pathog 2009, 5(4):e1000376. 10.1371/journal.ppat.1000376PubMed CentralView ArticlePubMedGoogle Scholar
- Samudrala R, Heffron F, McDermott JE: Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems. PLoS Pathog 2009, 5(4):e1000375. 10.1371/journal.ppat.1000375PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Y, Zhao J, Morgan RL, Ma W, Jiang T: Computational prediction of type III secreted proteins from gram-negative bacteria. BMC Bioinformatics 2010, 11(Suppl 1):S47. 10.1186/1471-2105-11-S1-S47PubMed CentralView ArticlePubMedGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA Data Mining Software: An Update. SIGKDD Explorations 2009., 11(1): 10.1145/1656274.1656278Google Scholar
- Eisenberg D, Schwarz E, Komaromy M, Wall R: Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J Mol Biol 1984, 179(1):125–142. 10.1016/0022-2836(84)90309-7View ArticlePubMedGoogle Scholar
- Grantham R: Amino acid difference formula to help explain protein evolution. Science 1974, 185(4154):862–864. 10.1126/science.185.4154.862View ArticlePubMedGoogle Scholar
- Levitt M: Conformational preferences of amino acids in globular proteins. Biochemistry 1978, 17(20):4277–4285. 10.1021/bi00613a026View ArticlePubMedGoogle Scholar
- UniProtKB/Swiss-Prot protein knowledgebase release 2010_08 statistics[http://au.expasy.org/sprot/relnotes/relstat.html]
- Raineri E, Ribeca P, Serrano L, Maier T: A more precise characterization of chaperonin substrates. Bioinformatics 2010, 26(14):1685–1689. 10.1093/bioinformatics/btq287View ArticlePubMedGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pontius JU, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2005, (33 Database):D39-D45.Google Scholar
- jQuery library[http://jquery.com/]
- Biegert A, Mayer C, Remmert M, Söding J, Lupas AN: The MPI Bioinformatics Toolkit for protein sequence analysis. Nucleic Acids Res 2006, (34 Web Server):W335-W339. 10.1093/nar/gkl217Google Scholar
- ExPASy Proteomics Server[http://expasy.org/tools/protscale.html]
- Witten IH, Frank E: Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann; 2005.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.