Volume 13 Supplement 7
Inferring a protein interaction map of Mycobacterium tuberculosis based on sequences and interologs
© Liu et al.; licensee BioMed Central Ltd. 2012
Published: 8 May 2012
Mycobacterium tuberculosis is an infectious bacterium posing serious threats to human health. Due to the difficulty in performing molecular biology experiments to detect protein interactions, reconstruction of a protein interaction map of M. tuberculosis by computational methods will provide crucial information to understand the biological processes in the pathogenic microorganism, as well as provide the framework upon which new therapeutic approaches can be developed.
In this paper, we constructed an integrated M. tuberculosis protein interaction network by machine learning and ortholog-based methods. Firstly, we built a support vector machine (SVM) method to infer the protein interactions of M. tuberculosis H37Rv by gene sequence information. We tested our predictors in Escherichia coli and mapped the genetic codon features underlying its protein interactions to M. tuberculosis. Moreover, the documented interactions of 14 other species were mapped to the interactome of M. tuberculosis by the interolog method. The ensemble protein interactions were validated by various functional relationships, i.e., gene coexpression, evolutionary relationship and functional similarity, extracted from heterogeneous data sources. The accuracy and validation demonstrate the effectiveness and efficiency of our framework.
A protein interaction map of M. tuberculosis is inferred from genetic codons and interologs. The prediction accuracy and numerically experimental validation demonstrate the effectiveness and efficiency of our method. Furthermore, our methods can be straightforwardly extended to infer the protein interactions of other bacterial species.
M. tuberculosis which causes tuberculosis affecting lungs and other organs is the second largest cause of death from infectious diseases . An extensive protein-protein interaction (PPI) network of M. tuberculosis can lead to more comprehensive screens of cellular operations. In this context, development of approaches to infer its interactome will contribute to identifying infectious mechanisms, detecting important drug target proteins and promoting potential therapy innovations. To date, genome-wide experimental and computational systems for studying PPIs in M. tuberculosis are unavailable . It is necessary to develop approaches capable of converting available genomic data into functional information of protein-interaction map for M. tuberculosis. E. coli is one of the best model systems to study bacterial physiology , with relatively well-characterized interactome, genome and transciptome . It is believed that the protein interactions are conserved in different organisms . The interaction features can be learned by machine learning methods, such as support vector machines (SVMs) [6, 7], and also it is common to predict protein interactions from the known interactions of other organisms by interolog method .
Compared with other methods, sequence-based prediction methods are superior for their simple requirement on the data, which could be implemented when the species have completely sequenced genomes. There were some studies that are based on sequence information have been successfully performed on PPI prediction of some model organisms such as H. sapiens, S. cerevisiae and E. coli [6, 9–12]. However, a limitation of these methods is the requirement of large size of training data to meet a satisfactory accuracy criterion. For model organisms, we have a large volume of prior PPIs that can be used as training data, but there are few experimental data of PPI for some dangerous bacteria like M. tuberculosis. Thus, a novel integration method is necessary to be developed. In this work, we provided cross-species PPI predictions in M. tuberculosis by integrating different types of protein interaction information of other species. Genetic information in the form of codons, i.e. tri-nucleotide sequences, are translated into proteins . It is well known that codon usage is correlated with expression level [9, 13]. The codon which carries genetic information specifies the amino acid sequence in the polypeptide during the synthesis of proteins. The genetics of coding sequences is not only the blueprint for translating amino acids, but also the continuous original information for genetic transcription of gene expression. Here, genetic codons will be selected as the sequence features in the learning of interaction patterns. Moreover, the corresponding orthologs of interacting proteins in other organisms will provide more information about the potential interaction mappings by comparative genomics.
In this work, we developed a systematic method combining heterogeneous data sources to infer a comprehensive protein interaction map for M. tuberculosis. The codon features of interacting protein pairs are detected and used to train an SVM classifier. Then the interactome of M. tuberculosis is predicted by the codon-based method. Moreover, the interactions from 14 other species are mapped to M. tuberculosis by the interolog method. The available data from multiple levels including gene coexpression, evolutionary relationship and functional similarity are implemented to assess these predicted interactions by confidence significance. The evidence from various sources validates the effectiveness of our method. The network properties of the constructed protein-interaction map are also identified. The predicted protein interaction network as well as the proposed method provide a framework for the functional specificities study of M. tuberculosis.
Prediction performances of the codon-based SVM predictor in E. coli
Protein interactions in M. tuberculosis
Details of predicted protein interactions in M. tuberculosis
By machine learning
Total: 46,119 interactions in 3,465 proteins (with 530 known PPIs)
Interacting protein pairs have been identified with close relationship of gene coexpression , coevolution , similar GO annotations , phenotype association and similar physicochemical elements . For M. tuberculosis species, we got these available heterogeneous data sources to annotate every predicted interacting pairs.
Topological parameters of protein-interaction map in M. tuberculosis
Characteristic path length
Power law fitting
0.5 + 0.5 + 0.3
(GOCC + GOMF + GOBP)
0.5 + 0.7
(PCC + COG)
0.5 + 0.9 + 0.8
(PCC + COG + GOCC)
0.5 + 0.7 + 0.4
(PCC + GOMF + GOBP)
0.5 + 0.7
(COG + GOBP)
In this work, we proposed a method to build the protein-interaction map in M. tuberculosis by machine learning and interologs. We obtained the interaction features of genetic codon underlying interacting proteins in relatively well-established interactome of E. coli. The features of genetic codons of interacting proteins of E. coli were mapped to the proteome of M. tuberculosis by training an SVM classifier. The cross validation showed the effectiveness and efficiency of our predictor. We also implemented the interolog method to map the documented protein interactions of other organisms into M. tuberculosis. Moreover, the available functional genomic information about M. tuberculosis has been used to evaluate the predicted interactions. These heterogeneous data were combined in a novel framework to infer the interactions in M. tuberculosis. The predicted pairs were checked and can be filtered with these information for potential applications. The constructed protein interaction network of M. tuberculosis provides more information for the infectious bacterium threatening human health.
We used multiple sources of available functional genomic data to provide evaluation of these predicted interactions. Gene coexpression, evolutionary relationship and functional similarity are implemented to check the reliability in the targeted pairs. The information could be directly used to build the functional relationship of protein pairs [23–25]. Due to the limited knowledge in M. tuberculosis, we integrated the heterogeneous information in an alternative framework for assessing the predictions rather than predicting the interactions. Filtering interactions by different confidence values result in different networks of different size and reliability. This will provide valuable resources for biological information in tuberculosis research, which implies the promising applications based on our constructed protein interaction map, which are our future research topics.
Prediction performances of the SVM predictor in these known protein interactions of M. tuberculosis
Basically, we implemented two pipelines of building the protein-interaction map of M. tuberculosis, i.e., the SVM-based machine learning method and the interolog mapping method. The two methods are essentially close-related. The gene sequence information of interacting pair of proteins has been learned by the predictor and that of these known interactions is mapped to the protein pairs of M. tuberculosis. In the same manner, the interolog method identifies the interaction between a pair of proteins which have interacting homologs in another organism. The protein sequence information of known interaction is mapped by the cross-species sequence similarity detection. It is an interesting research topic to identify the quantitative relationship between the prediction results of the two methods. The various mapping schemes of the sequence information have been integrated in our predictions. The gene sequence information as well as the protein sequence information is exploited to infer the protein-interaction map of M. tuberculosis. The other research direction is to implement other schemes to encode the sequence information in the machine learning method, such as the autocorrelation encoding scheme  and triplet residues method . We combined the gene sequence information and the protein sequence information into an integrated framework. It is also an interesting topic to investigate the prediction difference of the two-level sequence information.
In conclusion, we provided a novel framework to integrate genomic data to infer a protein interaction map of M. tuberculosis. We predicted the protein interactions in M. tuberculosis by an SVM based classifier by genetic codons. And the documented protein interactions from various species were also mapped to the proteome of M. tuberculosis by interolog method. The information from gene expression, evolutionary and functional relationship provided reliability measures of evaluating our predictions. The validations provided clear evidence for the effectiveness of our method. Our framework can easily be extended to infer the large-scale protein interaction map in other species. These predicted interactions provide a valuable reference of interactome for M. tuberculosis research. The PPIs build a frame to further study the functional implications underlying the interactome of M. tuberculosis. They are listed in Additional file 2. The details are available at: http://www.aporc.org/doc/wiki/MTBPPI.
Framework of prediction
We used the SVM method  as the classifier. The software libsvm 2.84  was employed and a radial basis function was chosen as the kernel function in our implementation. The positive pairs of training are those known interactions which are experimentally validated in EcID. There were 14,058 pairs of positive interactions. We selected the negative set by choosing the pairs when the length of shortest path between the two terminals in EcID network is larger than a given cutoff of 6 for the small-world property of a complex network . There is few possibility for two proteins to interact with each other when the distance is bigger than the threshold. There were 27,882 pairs of proteins which are included in the negative set. A five-fold cross validation process was implemented to test the accuracy of our SVM-based classifier. We applied the trained predictor to infer the protein interactions in M. tuberculosis.
where TP, TN, FP and FN refer to number of true positive, number of true negative, number of false positive, and number of false negative predictions, respectively.
Validation from multiple resources
We constructed the protein interaction map of M. tuberculosis by genetic codons and ortholog mapping. We also deposited known interactions in databases from experimental results about M. tuberculosis in literatures. Integrated with these known protein interactions, we built a comprehensive PPI map of M. tuberculosis. We collected multiple available resources to access the constructed protein interaction map in M. tuberculosis. The confidence of interactions was evaluated by three extra data sources, namely, gene expression, evolutionary relationship and functional similarity.
where anno(c) is the number of proteins annotated with this terms in our database. The set of child nodes of term c is the children(c). The probability of a term t is then defined as p(c) = freq(c)/freq(root), where freq(root) is the frequency of the root term [37, 38]. We used semantic similarity measures [36–38] to evaluate the similarity of GO term lists corresponding to the interacting proteins. Based on these validations, we can check those interactions consistently validated in various information and detect an ensemble protein network by omitting low reliability pairs.
support vector machine
receiver operating characteristic
area under curve
biomolecular interaction network database
biological general repository for interaction datasets
molecular interaction database
opening reading frame
cluster of orthologous group
national center for biotechnology information
gene expression omnibus.
This work was supported by Shanghai Natural Science Foundation under Grants No. 11ZR1443100 and by the Chief Scientist Program of Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS) under Grant No. 2009CSP002, and by the Knowledge Innovation Program of SIBS of CAS with Grant No. 2011KIP203 and No. KSCX2-EW-R-01. The first author gratefully acknowledges the support of SA-SIBS Scholarship Program. Part of the authors were also supported by the National Natural Science Foundation of China (NSFC) under Grant No. 31100949, 61072149 and 91029301, and partially supported by Aihara Project, the FIRST program from JSPS, initiated by CSTP.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 7, 2012: Advanced intelligent computing theories and their applications in bioinformatics. Proceedings of the 2011 International Conference on Intelligent Computing (ICIC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S7.
- Reddy T, Riley R, Wymore F, Montgomery P, DeCaprio D, Engels R, Gellesch M, Hubble J, Jen D, Jin H, et al.: TB database: an integrated platform for tuberculosis research. Nucleic Acids Res 2009, 37: D499-D508. 10.1093/nar/gkn652PubMed CentralView ArticlePubMedGoogle Scholar
- Singh A, Mai D, Kumar A, Steyn A: Dissecting virulence pathways of Mycobacterium tuberculosis through protein-protein association. Proc Natl Acad Sci USA 2006, 103: 11346–11351. 10.1073/pnas.0602817103PubMed CentralView ArticlePubMedGoogle Scholar
- Arifuzzaman M, Maeda M, Itoh A, Nishikata K, Takita C, Saito R, Ara T, Nakahigashi K, Huang H, Hirai A, et al.: Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res 2006, 16: 686–691. 10.1101/gr.4527806PubMed CentralView ArticlePubMedGoogle Scholar
- Andres Leon E, Ezkurdia I, García B, Valencia A, Juan D: EcID. A database for the inference of functional interactions in E. coli. Nucleic Acids Res 2009, 37: D629-D635. 10.1093/nar/gkn853PubMed CentralView ArticlePubMedGoogle Scholar
- Hirsh E, Sharan R: Identification of conserved protein complexes based on a model of protein network evolution. Bioinformatics 2007, 23: e170-e176. 10.1093/bioinformatics/btl295View ArticlePubMedGoogle Scholar
- Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H: Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci USA 2007, 104: 4337–4341. 10.1073/pnas.0607879104PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Y, Wang J, Yang Z, Deng N: Sequence-based protein-protein interaction prediction via support vector machine. J Syst Sci & Complexity 2010, 23: 1012–1023.View ArticleGoogle Scholar
- Yu H, Luscombe N, Lu H, Zhu X, Xia Y, Han J, Bertin N, Chung S, Vidal M, Gerstein M: Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res 2004, 14: 1107–1118. 10.1101/gr.1774904PubMed CentralView ArticlePubMedGoogle Scholar
- Najafabadi H, Salavati R: Sequence-based prediction of protein-protein interactions by means of codon usage. Genome Biol 2008, 9: R87. 10.1186/gb-2008-9-5-r87PubMed CentralView ArticlePubMedGoogle Scholar
- Xia J, Zhao X, Huang D: Predicting protein-protein interactions from protein sequences using meta predictor. Amino Acids 2010, 39: 1595–1599. 10.1007/s00726-010-0588-1View ArticlePubMedGoogle Scholar
- You Z, Lei Y, Huang D, Zhou X: Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data. Bioinformatics 2010, 26: 2744–2751. 10.1093/bioinformatics/btq510PubMed CentralView ArticlePubMedGoogle Scholar
- Shi M, Xia J, Li X, Huang D: Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset. Amino Acids 2010, 38: 891–899. 10.1007/s00726-009-0295-yView ArticlePubMedGoogle Scholar
- Jansen R, Bussemaker H, Gerstein M: Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models. Nucleic Acids Res 2003, 31: 2242–2251. 10.1093/nar/gkg306PubMed CentralView ArticlePubMedGoogle Scholar
- Cole S, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon S, Eiglmeier K, Gas S, Barry C, et al.: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 1998, 393: 537–544. 10.1038/31159View ArticlePubMedGoogle Scholar
- Alfarano C, Andrade C, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, et al.: The Biomolecular Interaction Network Database and related tools - 2005 update. Nucleic Acids Res 2005, 33: D418-D424.PubMed CentralView ArticlePubMedGoogle Scholar
- Vastrik I, D'Eustachio P, Schmidt E, Joshi-Tope G, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, Matthews L, Wu G, Birney E, Stein L: Reactome: a knowledge base of biologic pathways and processes. Genome Biol 2007, 8: R39. 10.1186/gb-2007-8-3-r39PubMed CentralView ArticlePubMedGoogle Scholar
- Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res 2002, 12: 37–46. 10.1101/gr.205602PubMed CentralView ArticlePubMedGoogle Scholar
- Jothi R, Kann M, Przytycka T: Predicting protein-protein interaction by searching evolutionary tree automorphism space. Bioinformatics 2005, 21: i241-i250. 10.1093/bioinformatics/bti1009PubMed CentralView ArticlePubMedGoogle Scholar
- Mahdavi M, Lin Y: False positive reduction in protein-protein interaction predictions using gene ontology annotations. BMC Bioinformatics 2007, 8: 262. 10.1186/1471-2105-8-262PubMed CentralView ArticlePubMedGoogle Scholar
- Chen L, Wu L, Wang Y, Zhang X: Inferring protein interactions from experimental data by association probabilistic method. Proteins 2006, 62: 833–837. 10.1002/prot.20783View ArticlePubMedGoogle Scholar
- Albert R, Barabasi A: Statistical mechanics of complex networks. Reviews of Modern Physics 2002, 74: 47. 10.1103/RevModPhys.74.47View ArticleGoogle Scholar
- Barabasi A, Oltvai Z: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5: 101–113. 10.1038/nrg1272View ArticlePubMedGoogle Scholar
- Eisenberg D, Marcotte E, Xenarios I, Yeates T: Protein function in the post-genomic era. Nature 2000, 405: 823–826. 10.1038/35015694View ArticlePubMedGoogle Scholar
- Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan N, Chung S, Emili A, Snyder M, Greenblatt J, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 2003, 302: 449–453. 10.1126/science.1087361View ArticlePubMedGoogle Scholar
- Lee I, Date S, Adai A, Marcotte E: A probabilistic functional network of yeast genes. Science 2004, 306: 1555–1558. 10.1126/science.1099511View ArticlePubMedGoogle Scholar
- Guo Y, Yu L, Wen Z, Li M: Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res 2008, 36: 3025–3030. 10.1093/nar/gkn159PubMed CentralView ArticlePubMedGoogle Scholar
- Kerrien S, Alam-Faruque Y, Aranda B, et al.: IntAct-open source resource for molecular interaction data. Nucleic Acids Res 2007, 35: D561-D565. 10.1093/nar/gkl958PubMed CentralView ArticlePubMedGoogle Scholar
- Xenarios I, Salwinski L, Duan X, Higney P, Kim S, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 2002, 30: 303–305. 10.1093/nar/30.1.303PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Vapnik V: The Nature of Statistical Learning Theory. New York: Springer-Verlag; 1995.View ArticleGoogle Scholar
- Chang C, Lin C: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2011, 2: 1–27.View ArticleGoogle Scholar
- Barrett T, Troup D, Wilhite S, Ledoux P, Rudnev D, Evangelista C, Kim I, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles-database and tools update. Nucleic Acids Res 2007, 35: D760-D765. 10.1093/nar/gkl887PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov R, Galperin M, Natale D, Koonin E: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 2000, 28: 33–36. 10.1093/nar/28.1.33PubMed CentralView ArticlePubMedGoogle Scholar
- Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat Genet 2000, 25: 25–29. 10.1038/75556View ArticleGoogle Scholar
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32: D262-D266. 10.1093/nar/gkh021PubMed CentralView ArticlePubMedGoogle Scholar
- Lord P, Stevens R, Brass A, Goble C: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19: 1275–1283. 10.1093/bioinformatics/btg153View ArticlePubMedGoogle Scholar
- Schlicker A, Domingues F, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 2006, 7: 302. 10.1186/1471-2105-7-302PubMed CentralView ArticlePubMedGoogle Scholar
- Liu ZP, Wu LY, Wang Y, Chen L, Zhang XS: Predicting gene ontology functions from protein's regional surface structures. BMC Bioinformatics 2007, 8: 475. 10.1186/1471-2105-8-475PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.