- Methodology article
- Open Access
Genome-wide prediction of prokaryotic two-component system networks using a sequence-based meta-predictor
© Kara et al. 2015
Received: 29 March 2015
Accepted: 16 September 2015
Published: 18 September 2015
Two component systems (TCS) are signalling complexes manifested by a histidine kinase (receptor) and a response regulator (effector). They are the most abundant signalling pathways in prokaryotes and control a wide range of biological processes. The pairing of these two components is highly specific, often requiring costly and time-consuming experimental characterisation. Therefore, there is considerable interest in developing accurate prediction tools to lessen the burden of experimental work and cope with the ever-increasing amount of genomic information.
We present a novel meta-predictor, MetaPred2CS, which is based on a support vector machine. MetaPred2CS integrates six sequence-based prediction methods: in-silico two-hybrid, mirror-tree, gene fusion, phylogenetic profiling, gene neighbourhood, and gene operon. To benchmark MetaPred2CS, we also compiled a novel high-quality training dataset of experimentally deduced TCS protein pairs for k-fold cross validation, to act as a gold standard for TCS partnership predictions. Combining individual predictions using MetaPred2CS improved performance when compared to the individual methods and in comparison with a current state-of-the-art meta-predictor.
We have developed MetaPred2CS, a support vector machine-based metapredictor for prokaryotic TCS protein pairings. Central to the success of MetaPred2CS is a strategy of integrating individual predictors that improves the overall prediction accuracy, with the in-silico two-hybrid method contributing most to performance. MetaPred2CS outperformed other available systems in our benchmark tests, and is available online at http://metapred2cs.ibers.aber.ac.uk, along with our gold standard dataset of TCS interaction pairs.
A wide range of critical functions in prokaryotes such as antibiotic resistance, stationary phase transition, competence, sporulation, chemotaxis, nitrogen regulation, virulence, and phosphate regulation are mediated by a particular type of signalling pathway known as a two-component system (TCS) . TCS typically operate through the transfer of phosphoryl groups from a His residue of a histidine kinase (HK) to an Asp residue of a response regulator (RR), in response to an extracellular stimulus. A variant of the TCS, known as a phosphorelay, includes extra receiver and phosphotransfer domains relaying the phosphoryl group between the HK and RR proteins .
Genome-wide identification of HK and RR proteins is relatively straightforward , with a variety of TCS databases and prediction servers available [4–6]. However the identification of HK-RR pairs is challenging as TCS pairs are highly specific , there are multiple HK and RRs in most genomes, and their genes are often unpaired (orphan HKs and RRs). Several experimental approaches have been used to identify HK-RR pairs, including phosphotransfer profiling [8–10] and yeast two-hybrid assays [11–14]. Such approaches are costly and labour intensive, therefore it is important to develop computational tools to lessen the burden and complement experimental approaches.
The use of meta-predictors, predictors that combine predictions from individual methods using machine learning algorithms, is a common approach in bioinformatics [15–21]. The advantage of meta-predictors is that they do not rely on single methods, but can integrate a wide range of information under a common probabilistic umbrella without relying on complex scoring functions . In doing so, the strengths and weaknesses of individual predictions are combined to achieve higher levels of accuracy . Examples of meta-predictors include those developed for the prediction of functional sites in proteins  or prediction of critical residues in protein interfaces .
In this work we present MetaPred2CS, a sequence-based meta-predictor designed specifically to predict protein pairs in TCS. MetaPred2CS is based on a Support Vector Machine (SVM) [25, 26] and combines six independent and orthologous protein-protein interaction prediction methods: in-silico two-hybrid (i2h) , mirror tree (MT) , phylogenetic profiling (PP) , gene fusion (GF) , gene neighbourhood (GN) and gene operon (GO) . The i2h and MT methods are based on co-evolution theory, and rely on high quality and complete multiple sequence alignments (MSAs), while PP, GF, GN and GO, are genome context methods, utilising different genomic information such as chromosomal proximity (GN), operons (GO), fusion events (GF) and inter-genomic profiles (PP) between fully sequenced genomes.
The identification performance of MetaPred2CS was tested using validated experimental data and it achieved a higher accuracy compared to individual prediction methods such as i2h, MT, GF, PP, GN and GO. MetaPred2CS also compared favourably against a Bayesian meta-predictor benchmarked on TCS pairs  and a database of protein-protein interactions: STRING .
Datasets: training and testing
A variety of datasets, described in detail below, were used during the development of the predictor to benchmark its performance under different scenarios and to compare to an independent, competing, method and pre computed scores in the STRING  database. File1.xls and File2.pdf in the Additional files 1 and 2 provide complete details and a diagrammatic representation of the all sets described below.
The P+ and P- datasets
The P+ and P- sets contain 113 interacting and 1134 non-interacting experimentally validated TCS pairs respectively, and were compiled and manually curated from the current literature. These sets were used to train and test the MetaPred2CS using a k-fold cross validation strategy. Specifically, the P+ set was compiled by mining protein-protein interaction databases, including BioGriD , DIP , IntAct , PSI-MI , UniProtKB  and MINT  using the RefSeq identifiers extracted from the P2CS database . To create P-, experimentally validated non-interacting pairs were mined from publications describing high-throughput, systematic, yeast two-hybrid or phosphotransfer profiling experiments, from a number of organisms including: Caulobacter crescentus, Escherichia coli, Mycobacterium tuberculosis, Myxococcus xanthus, Synechocystis sp. and Mesorhizobium loti [8–14].
To test the performance of MetaPred2CS under different scenarios and to compare it to an independent, competing, methods, we derived interdependent testing sets as described below (NP+, OP+, species-specific, T, SP+ and SP- sets). In each test, MetaPred2CS was trained with the corresponding, orthogonal, version of P+ and P- (i.e. removing any proteins present in the testing subset).
NP+ and OP+ and Species-specific datasets
The NP+ set (for Neighbouring Pairs) contains 56 pairs of TCS that are encoded by neighbouring genes, while the OP+ set (for Orphan Pairs) is composed of 57 pairs that are encoded by genes, which are not adjacent in the genome. This distinction is important, as predictions of orphan pairs are usually more challenging [40–42]. In order to further clarify species-specific and positive-to-negative class ratio bias in the predictions, we also produced four different species-specific testing sets: Escherichia coli, Myxococcus xanthus, Synechocytis sp, and Mesorhizobim lotis.
T, SP+ and SP- datasets
Datasets T, SP+ and SP- were extracted from the work by Burger and van Nimwegen  as testing sets to compare MetaPred2CS performance. For all these three testing sets, MetaPred2CS was trained with an orthogonal version of P+ and P- sets, i.e. any pair present in either of the testing sets was removed from P+ or P- prior to training. The T dataset is composed of 16 experimentally validated interacting pairs and 5 non-interacting pairs while the SP+ and SP- sets are composed of pairs of TCS extracted from the SwissRegulon database . In addition, The SP+ and SP- was also used to compare to STRING  database.
The MetaPred2CS prediction method
Individual prediction methods
The selection of individual methodologies was based on their orthogonalitynature, i.e. sequence-based, performance and availability. MetaPred2CS integrates the prediction of six different methods: i2h, MT, PP, GF, GN and GO. Briefly, the i2h method scans for correlated (compensatory) mutations between residues of the two proteins of interest, where a high-correlation implies high-probability that given pair interacts . The MT method relies on similarity between phylogenetic trees to infer the likelihood of interactions between pairs of proteins . The GN and GO methods are based on the observation that proteins that are functionally related tend to be transcribed and expressed concurrently, i.e. are encoded by adjacent genes, particularly in prokaryotes . The PP method is based on the idea that functionally related genes under strong selective pressure appear or disappear together as units during speciation events . Finally, the GF method is based on fusion events, i.e. if two proteins appear as independent units in one organism but as a joint entity in another organism, then it is likely that the individual units are actually an interacting pair . Detailed information about these methods as well as their technical aspects can be found in the individual references indicated above.
Reference genome dataset
The GN, GO, GF, and PP methods rely on a reference genome dataset, the quality of which, in the form of size and diversity, is central to their performance . To that end, and to maximize the prediction performance of these methodologies, we compiled a diverse, yet relatively small (as a compromise between performance and calculation speed), reference genome dataset based on the most successful genome combinations proposed by Muley and Ranjan . Our dataset comprised 243 individual genomes, belonging to 22 different classes (for further details see Table S1 in the Additional file 3). Genome annotations were downloaded from the NCBI database  and operon data from annotated genomes using Moreno-Hagelsieb and Collado-Vides' approach  available at http://microbiome.wlu.ca/public/TUpredictions/Predictions/ .
Implementation of individual prediction methods
All methods were implemented as described in their original publications. Co-evolutionary-based methods (i2h and MT) rely on the quality of MSAs, i.e. their completeness and diversity. We generated MSAs following the approach described in our previous work , using UniprotKB . ParseBlast generates complete and diverse MSAs by filtering both highly identical and highly dissimilar sequence homologs, considering also the coverage of the alignment between query and hit proteins . With the exception of the minimum and maximum number of species represented in the MSAs, 25 and 50 respectively, the rest of the prediction parameters were set to default values as described in the original works [27, 28]. These two parameters set the number of sequences shared between both MSAs, which include the sequences of common species in both alignments selecting the ones with the highest sequence identity to the corresponding pair. Thus, the minimum and maximum number in common between the two MSAs is an important aspect on these methodologies as its performance is highly influenced by these two parameters, i.e. the diversity of alignment.
We used InPrePPI , which implements all the genome context based methods (GF, PP, GN and GO), which requires a reference genome dataset and genome annotation (e.g. operon units) as described above. Among the parameters required for prediction are: (i) the evolutionary distances between target and reference organisms, which was calculated using 16S RNA data as described previously ; (ii) an e-value cut-off of 1e-5 for BLASTP  searches; (iii) a cut-off of 0.35 for mutual information values (required for PP); and (iv) a distance cut-off of 200 bp for the GN/GO predictions, as suggested previously [45, 49, 50]. The GF method identified fusions events, which in the case of TCS result in hybrid proteins combining a HK and RR in single coding unit, by using local alignments based on the Smith-Waterman algorithm , implemented in the ssearch36 program using default parameters .
Integration of individual prediction methods using a SVM: MetaPred2CS
Benchmarking and comparison of MetaPred2CS performance
AUC values of predictions by individual methods for the P+/P-, NP+/P- and OP+/P- datasets. GN and GO methods were not included the AUC comparison given the large genomic distance between pairs on the P- dataset that made the predictions unfeasible
AUC Values of Individual Methods
Assessing MetaPred2CS performance
Where TP, FP, TN, and FN represent true positives, false positives, true negatives and false negatives, respectively. Particularly important are the MCC values, given the disequilibrium of positive and negative classes, i.e. the difference in size of interacting and non-interacting pairs. The statistical analysis of ROC curves was performed using STAR .
Results and Discussion
Evaluation of individual prediction methods
Individual methods were tested on the P+/P-, NP+/P-, and OP+/P- datasets. Prediction performance metrics of each method are presented and compared using AUC values (Table 1). Co-evolutionary methods (i2h and MT) performed better than the genomic context methods, and the i2h method outperformed all other methods for each dataset, with MT being the next best method for each dataset.
With the exception of the PP, the best performance was on the NP+/P- dataset. This was expected because predicting orphan pairs is usually more challenging than neighboring pairs. PP however rely on comparison across genomes where interacting pairs either appear or disappear concurrently, hence genomic context does not play a unique contribution. Consequently, PP achieved the best performance on the OP+/P- dataset of the genomic context methods. It also performed similarly to the GF method on the P+/P- dataset. In the case of GN and GO, intrinsic limitations of these methodologies, i.e. rely on genomic distance, prevented its use on the P- and OP+ dataset, hence AUC and MCC could not be calculated, hence not presented in Table 1. Nonetheless, GN and GO are valid strategies in the prediction of pairing in neighboring genes (51 pairs out of 57 on the NP+ were predicted correctly), hence GH and GO predictions were considered as part input vector for the meta-predictor (see next).
Contribution of individual methods to MetaPred2CS
Combinations of prediction methods and prediction performance at 10-fold cross-validation. 1: i2h not included, 2: MT not included, 3: GF not included, 4: PP not included, 5: GN not included, 6: GO not included, 7: GN and GO not included, 8: all methods included. AUC and MCC represent the area under the ROC curved and Matthew’s correlation coefficient respectively
1 : i2h method excluded
2 : MT method excluded
3 : GF method excluded
4 : PP method excluded
5 : GN method excluded
6 : GO method excluded
7 : GN/GO methods excluded
8 : all methods included
Performance of default predictor on species-specific gene sets. Sensitivity, specificity, accuracy and MCC values are presented, as defined in the text
Species used as test data
Performance of Classifier
Escherichia coli K-12 MG1655
Myxococcus xanthus DK1622
Synechocystis sp. PCC6803
Mesorhizobium loti MAFF303099
Predictions of neighbouring and orphan pairs (NP+/P- and OP+/P- sets)
Prediction performance of default predictor on neighbouring and orphan pairs. AUC and MCC values for MetaPred2CS trained on the NP+/P- and OP+/P- datasets at different level of K-fold cross-validation
Performance of Classifier According to Cross-validation Levels
Comparison of MetaPred2CS and a competing machine-learning method and STRING database
Prediction of the T dataset by the Bayesian approach  and MetaPred2CSc. Non-interacting protein pairs are marked with an asterisk and best predictions are highlighted in bold
CC0248 - CC0247
CC0289 - CC0294
CC2755 - CC2757
CC2765 - CC2766
CC2932 - CC2931
CenK - CenR
CckN - DivK
ChpT - CtrA
ChpT - CpdR
DivJ - CtrA
DivJ - PleD
DivJ - DivK
DivL - DivK
DivL - CtrA
PleC - DivK
PleC - PleD
ChpT - CC3477*
ChpT - CC2757*
ChpT - CenR*
PleC - CtrA*
PleC - CC3477*
In this work we present a novel sequence-based prediction method designed specifically for TCS signalling networks: MetaPred2CS. The method was systematically assessed under different benchmarking scenarios, and performed well in all conditions, including using species-specific gene sets and TCS with different genome architecture features (i.e. neighbouring proteins vs. orphans). We show that integration of individual prediction methodologies improves the performance of the predictions, and that MetaPred2CS prediction performance compared favourably to existing methodologies. MetaPred2CS is accessible through a dedicated web-server at http://metapred2cs.ibers.aber.ac.uk.
We thank the P2CS team for help extracting information from the P2CS database. We also thank all the authors of the methods and data used in this work for making them freely available to the scientific community. Authors thank anonymous reviewers for constructive and insightful comments and suggestions to this work.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Whitworth DE. Two-component regulatory systems in prokaryotes. In: Filloux A, editor. Bacterial Regulatory Networks. Norfolk: Horizon Scientific Press; 2012. p. 191–222.Google Scholar
- Appleby JL, Parkinson JS, Bourret RB. Signal transduction via the multi-step phosphorelay: not necessarily a road less traveled. Cell. 1996;86:845–8.View ArticlePubMedGoogle Scholar
- Whitworth DE. Classification and organization of two-component systems. In: Gross R, Beier D, editors. Two-component Systems in Bacteria. Norfolk: Horizon Scientific Press; 2012. p. 1–20.Google Scholar
- Ortet P, Whitworth DE, Santaella C, Achouak W, Barakat M. P2CS: updates of the prokaryotic two-component systems database. Nucleic Acids Res. 2015;43:D536–41. doi:10.1093/nar/gku968.View ArticlePubMedGoogle Scholar
- Barakat M, Ortet P, Whitworth DE. P2RP: a Web-based framework for the identification and analysis of regulatory proteins in prokaryotic genomes. BMC Genomics. 2013;14:269. doi:10.1186/1471-2164-14-269.View ArticlePubMedPubMed CentralGoogle Scholar
- Ulrich LE, Zhulin IB. The MiST2 database: a comprehensive genomics resource on microbial signal transduction. Nucleic Acids Res. 2010;38:D401–7. doi:10.1093/nar/gkp940.View ArticlePubMedGoogle Scholar
- Laub MT, Goulian M. Specificity in two-component signal transduction pathways. Annu Rev Genet. 2007;41:121–45. doi:10.1146/annurev.genet.41.042007.170548.View ArticlePubMedGoogle Scholar
- Willett JW, Tiwari N, Müller S, Hummels KR, Houtman JCD, Fuentes EJ, et al. Specificity residues determine binding affinity for two-component signal transduction systems. mBio. 2013;4:e00420–00413. doi:10.1128/mBio.00420-13.View ArticlePubMedPubMed CentralGoogle Scholar
- Laub MT, Biondi EG, Skerker JM. Phosphotransfer profiling: systematic mapping of two-component signal transduction pathways and phosphorelays. Methods Enzymol. 2007;423:531–48. doi:10.1016/S0076-6879(07)23026-5.View ArticlePubMedGoogle Scholar
- Skerker JM, Prasol MS, Perchuk BS, Biondi EG, Laub MT. Two-component signal transduction pathways regulating growth and cell cycle progression in a bacterium: a system-level analysis. PLoS Biol. 2005;3:e334. doi:10.1371/journal.pbio.0030334.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee H-N, Jung K-E, Ko I-J, Baik HS, Oh J-I. Protein-protein interactions between histidine kinases and response regulators of Mycobacterium tuberculosis H37Rv. J Microbiol Seoul Korea. 2012;50:270–7. doi:10.1007/s12275-012-2050-4.Google Scholar
- Sato S, Shimoda Y, Muraki A, Kohara M, Nakamura Y, Tabata S. A large-scale protein protein interaction analysis in Synechocystis sp. PCC6803. DNA Res Int J Rapid Publ Rep Genes Genomes. 2007;14:207–16. doi:10.1093/dnares/dsm021.Google Scholar
- Shimoda Y, Shinpo S, Kohara M, Nakamura Y, Tabata S, Sato S. A large scale analysis of protein-protein interactions in the nitrogen-fixing bacterium Mesorhizobium loti. DNA Res Int J Rapid Publ Rep Genes Genomes. 2008;15:13–23. doi:10.1093/dnares/dsm028.Google Scholar
- Whitworth DE, Millard A, Hodgson DA, Hawkins PF. Protein-protein interactions between two-component system transmitter and receiver domains of Myxococcus xanthus. Proteomics. 2008;8:1839–42. doi:10.1002/pmic.200700544.View ArticlePubMedGoogle Scholar
- Friedberg I, Harder T, Godzik A. JAFA: a protein function annotation meta-server. Nucleic Acids Res. 2006;34:W379–81. doi:10.1093/nar/gkl045.View ArticlePubMedPubMed CentralGoogle Scholar
- Ishida T, Kinoshita K. Prediction of disordered regions in proteins based on the meta approach. Bioinforma Oxf Engl. 2008;24:1344–8. doi:10.1093/bioinformatics/btn195.View ArticleGoogle Scholar
- Kurowski MA, Bujnicki JM. GeneSilico protein structure prediction meta-server. Nucleic Acids Res. 2003;31:3305–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Pawlowski M, Gajda MJ, Matlak R, Bujnicki JM. MetaMQAP: a meta-server for the quality assessment of protein models. BMC Bioinformatics. 2008;9:403. doi:10.1186/1471-2105-9-403.View ArticlePubMedPubMed CentralGoogle Scholar
- Saini HK, Fischer D. Meta-DP: domain prediction meta-server. Bioinforma Oxf Engl. 2005;21:2917–20. doi:10.1093/bioinformatics/bti445.View ArticleGoogle Scholar
- Xue B, Dunbrack RL, Williams RW, Dunker AK, Uversky VN. PONDR-FIT: a meta-predictor of intrinsically disordered amino acids. Biochim Biophys Acta. 1804;2010:996–1010. doi:10.1016/j.bbapap.2010.01.011.Google Scholar
- Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B. Improved disorder prediction by combination of orthogonal approaches. PloS One. 2009;4:e4433. doi:10.1371/journal.pone.0004433.View ArticlePubMedPubMed CentralGoogle Scholar
- Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR. Inference in Bayesian networks. Nat Biotechnol. 2006;24:51–3. doi:10.1038/nbt0106-51.View ArticlePubMedGoogle Scholar
- Segura J, Jones PF, Fernandez-Fuentes N. A holistic in silico approach to predict functional sites in protein structures. Bioinforma Oxf Engl. 2012;28:1845–50. doi:10.1093/bioinformatics/bts269.View ArticleGoogle Scholar
- Assi SA, Tanaka T, Rabbitts TH, Fernandez-Fuentes N. PCRPi: Presaging Critical Residues in Protein interfaces, a new computational tool to chart hot spots in protein interfaces. Nucleic Acids Res. 2010;38:e86. doi:10.1093/nar/gkp1158.View ArticlePubMedGoogle Scholar
- Noble WS. What is a support vector machine? Nat Biotechnol. 2006;24:1565–7. doi:10.1038/nbt1206-1565.View ArticlePubMedGoogle Scholar
- Yang ZR. Biological applications of support vector machines. Brief Bioinform. 2004;5:328–38.View ArticlePubMedGoogle Scholar
- Pazos F, Valencia A. In silico two-hybrid system for the selection of physically interacting protein pairs. Proteins. 2002;47:219–27.View ArticlePubMedGoogle Scholar
- Pazos F, Valencia A. Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng. 2001;14:609–14.View ArticlePubMedGoogle Scholar
- Sun J, Xu J, Liu Z, Liu Q, Zhao A, Shi T, et al. Refined phylogenetic profiles method for predicting protein-protein interactions. Bioinforma Oxf Engl. 2005;21:3409–15. doi:10.1093/bioinformatics/bti532.View ArticleGoogle Scholar
- Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA. Protein interaction maps for complete genomes based on gene fusion events. Nature. 1999;402:86–90. doi:10.1038/47056.View ArticlePubMedGoogle Scholar
- Shoemaker BA, Panchenko AR. Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput Biol. 2007;3:e43. doi:10.1371/journal.pcbi.0030043.View ArticlePubMedPubMed CentralGoogle Scholar
- Burger L, van Nimwegen E. Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Mol Syst Biol. 2008;4:165. doi:10.1038/msb4100203.View ArticlePubMedPubMed CentralGoogle Scholar
- Von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B. STRING: a database of predicted functional associations between proteins. Nucleic Acids Res. 2003;31:258–61.View ArticleGoogle Scholar
- Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34:D535–9. doi:10.1093/nar/gkj109.View ArticlePubMedGoogle Scholar
- Xenarios I, Salwínski L, Duan XJ, Higney P, Kim S-M, Eisenberg D. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002;30:303–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, et al. IntAct: an open source molecular interaction database. Nucleic Acids Res. 2004;32:D452–5. doi:10.1093/nar/gkh052.View ArticlePubMedPubMed CentralGoogle Scholar
- Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, et al. The HUPO PSI’s molecular interaction format--a community standard for the representation of protein interaction data. Nat Biotechnol. 2004;22:177–83. doi:10.1038/nbt926.View ArticlePubMedGoogle Scholar
- Magrane M, Consortium U. UniProt Knowledgebase: a hub of integrated protein data. Database J Biol Databases Curation. 2011;2011:bar009. doi:10.1093/database/bar009.Google Scholar
- Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G. MINT: a Molecular INTeraction database. FEBS Lett. 2002;513:135–40.View ArticlePubMedGoogle Scholar
- Cock PJA, Whitworth DE. Evolution of gene overlaps: relative reading frame bias in prokaryotic two-component system genes. J Mol Evol. 2007;64:457–62. doi:10.1007/s00239-006-0180-1.View ArticlePubMedGoogle Scholar
- Cock PJA, Whitworth DE. Evolution of prokaryotic two-component system signaling pathways: gene fusions and fissions. Mol Biol Evol. 2007;24:2355–7. doi:10.1093/molbev/msm170.View ArticlePubMedGoogle Scholar
- Williams RHN, Whitworth DE. The genetic organisation of prokaryotic two-component system signalling pathways. BMC Genomics. 2010;11:720. doi:10.1186/1471-2164-11-720.View ArticlePubMedPubMed CentralGoogle Scholar
- Muley VY, Ranjan A. Effect of reference genome selection on the performance of computational methods for genome-wide protein-protein interaction prediction. PloS One. 2012;7:e42057. doi:10.1371/journal.pone.0042057.View ArticlePubMedPubMed CentralGoogle Scholar
- Tatusova T, Ciufo S, Fedorov B, O’Neill K. Tolstoy I. About Prokaryotic Genome Processing and Tools: Zaslavsky L; 2014.Google Scholar
- Moreno-Hagelsieb G, Collado-Vides J. A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics. 2002;18:S329–36. doi:10.1093/bioinformatics/18.suppl_1.S329.View ArticlePubMedGoogle Scholar
- Fernandez-Fuentes N, Rai BK, Madrid-Aliste CJ, Fajardo JE, Fiser A. Comparative protein structure modeling by combining multiple templates and optimizing sequence-to-structure alignments. Bioinforma Oxf Engl. 2007;23:2558–65. doi:10.1093/bioinformatics/btm377.View ArticleGoogle Scholar
- Sun J, Sun Y, Ding G, Liu Q, Wang C, He Y, et al. InPrePPI: an integrated evaluation method based on genomic context for predicting protein-protein interactions in prokaryotic genomes. BMC Bioinformatics. 2007;8:414. doi:10.1186/1471-2105-8-414.View ArticlePubMedPubMed CentralGoogle Scholar
- Bhagwat M, Aravind L. PSI-BLAST tutorial. Methods Mol Biol Clifton NJ. 2007;395:177–86.View ArticleGoogle Scholar
- Strong M, Mallick P, Pellegrini M, Thompson MJ, Eisenberg D. Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach. Genome Biol. 2003;4:R59. doi:10.1186/gb-2003-4-9-r59.View ArticlePubMedPubMed CentralGoogle Scholar
- Ermolaeva MD, White O, Salzberg SL. Prediction of operons in microbial genomes. Nucleic Acids Res. 2001;29:1216–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–7.View ArticlePubMedGoogle Scholar
- Pearson WR. Flexible Sequence Similarity Searching with the FASTA3 Program Package. In: Misener S, Krawetz SA, editors. Bioinformatics Methods and Protocols. New Jersey: Springer Science & Business Media; 1999. p. 185–216.View ArticleGoogle Scholar
- Chang C-C, Lin C-J. LIBSVM: A Library for Support Vector Machines. ACM Trans Intell Syst Technol. 2011;2:27:1–27. doi:10.1145/1961189.1961199. 27.View ArticleGoogle Scholar
- Cho BH, Yu H, Lee J, Chee YJ, Kim IY, Kim SI. Nonlinear support vector machine visualization for risk factor analysis using nomograms and localized radial basis function kernels. IEEE Trans Inf Technol Biomed Publ IEEE Eng Med Biol Soc. 2008;12:247–56. doi:10.1109/TITB.2007.902300.View ArticleGoogle Scholar
- Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000;16:412–24. doi:10.1093/bioinformatics/16.5.412.View ArticlePubMedGoogle Scholar
- Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem. 1993;39:561–77.PubMedGoogle Scholar
- Vergara IA, Norambuena T, Ferrada E, Slater AW, Melo F. StAR: a simple tool for the statistical comparison of ROC curves. BMC Bioinformatics. 2008;9:265. doi:10.1186/1471-2105-9-265.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang QC, Petrey D, Deng L, Qiang L, Shi Y, Thu CA, et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature. 2012;490:556–60. doi:10.1038/nature11503.View ArticlePubMedPubMed CentralGoogle Scholar
- Planas-Iglesias J, Bonet J, García-García J, Marín-López MA, Feliu E, Oliva B. Understanding protein-protein interactions using local structural features. J Mol Biol. 2013;425:1210–24. doi:10.1016/j.jmb.2013.01.014.View ArticlePubMedGoogle Scholar