Inferring a protein interaction map of Mycobacterium tuberculosis based on sequences and interologs
© Liu et al.; licensee BioMed Central Ltd. 2012
Published: 8 May 2012
Skip to main content
© Liu et al.; licensee BioMed Central Ltd. 2012
Published: 8 May 2012
Mycobacterium tuberculosis is an infectious bacterium posing serious threats to human health. Due to the difficulty in performing molecular biology experiments to detect protein interactions, reconstruction of a protein interaction map of M. tuberculosis by computational methods will provide crucial information to understand the biological processes in the pathogenic microorganism, as well as provide the framework upon which new therapeutic approaches can be developed.
In this paper, we constructed an integrated M. tuberculosis protein interaction network by machine learning and ortholog-based methods. Firstly, we built a support vector machine (SVM) method to infer the protein interactions of M. tuberculosis H37Rv by gene sequence information. We tested our predictors in Escherichia coli and mapped the genetic codon features underlying its protein interactions to M. tuberculosis. Moreover, the documented interactions of 14 other species were mapped to the interactome of M. tuberculosis by the interolog method. The ensemble protein interactions were validated by various functional relationships, i.e., gene coexpression, evolutionary relationship and functional similarity, extracted from heterogeneous data sources. The accuracy and validation demonstrate the effectiveness and efficiency of our framework.
A protein interaction map of M. tuberculosis is inferred from genetic codons and interologs. The prediction accuracy and numerically experimental validation demonstrate the effectiveness and efficiency of our method. Furthermore, our methods can be straightforwardly extended to infer the protein interactions of other bacterial species.
M. tuberculosis which causes tuberculosis affecting lungs and other organs is the second largest cause of death from infectious diseases . An extensive protein-protein interaction (PPI) network of M. tuberculosis can lead to more comprehensive screens of cellular operations. In this context, development of approaches to infer its interactome will contribute to identifying infectious mechanisms, detecting important drug target proteins and promoting potential therapy innovations. To date, genome-wide experimental and computational systems for studying PPIs in M. tuberculosis are unavailable . It is necessary to develop approaches capable of converting available genomic data into functional information of protein-interaction map for M. tuberculosis. E. coli is one of the best model systems to study bacterial physiology , with relatively well-characterized interactome, genome and transciptome . It is believed that the protein interactions are conserved in different organisms . The interaction features can be learned by machine learning methods, such as support vector machines (SVMs) [6, 7], and also it is common to predict protein interactions from the known interactions of other organisms by interolog method .
Compared with other methods, sequence-based prediction methods are superior for their simple requirement on the data, which could be implemented when the species have completely sequenced genomes. There were some studies that are based on sequence information have been successfully performed on PPI prediction of some model organisms such as H. sapiens, S. cerevisiae and E. coli [6, 9–12]. However, a limitation of these methods is the requirement of large size of training data to meet a satisfactory accuracy criterion. For model organisms, we have a large volume of prior PPIs that can be used as training data, but there are few experimental data of PPI for some dangerous bacteria like M. tuberculosis. Thus, a novel integration method is necessary to be developed. In this work, we provided cross-species PPI predictions in M. tuberculosis by integrating different types of protein interaction information of other species. Genetic information in the form of codons, i.e. tri-nucleotide sequences, are translated into proteins . It is well known that codon usage is correlated with expression level [9, 13]. The codon which carries genetic information specifies the amino acid sequence in the polypeptide during the synthesis of proteins. The genetics of coding sequences is not only the blueprint for translating amino acids, but also the continuous original information for genetic transcription of gene expression. Here, genetic codons will be selected as the sequence features in the learning of interaction patterns. Moreover, the corresponding orthologs of interacting proteins in other organisms will provide more information about the potential interaction mappings by comparative genomics.
In this work, we developed a systematic method combining heterogeneous data sources to infer a comprehensive protein interaction map for M. tuberculosis. The codon features of interacting protein pairs are detected and used to train an SVM classifier. Then the interactome of M. tuberculosis is predicted by the codon-based method. Moreover, the interactions from 14 other species are mapped to M. tuberculosis by the interolog method. The available data from multiple levels including gene coexpression, evolutionary relationship and functional similarity are implemented to assess these predicted interactions by confidence significance. The evidence from various sources validates the effectiveness of our method. The network properties of the constructed protein-interaction map are also identified. The predicted protein interaction network as well as the proposed method provide a framework for the functional specificities study of M. tuberculosis.
Prediction performances of the codon-based SVM predictor in E. coli
Details of predicted protein interactions in M. tuberculosis
By machine learning
Total: 46,119 interactions in 3,465 proteins (with 530 known PPIs)
Interacting protein pairs have been identified with close relationship of gene coexpression , coevolution , similar GO annotations , phenotype association and similar physicochemical elements . For M. tuberculosis species, we got these available heterogeneous data sources to annotate every predicted interacting pairs.
Topological parameters of protein-interaction map in M. tuberculosis
Characteristic path length
Power law fitting
0.5 + 0.5 + 0.3
(GOCC + GOMF + GOBP)
0.5 + 0.7
(PCC + COG)
0.5 + 0.9 + 0.8
(PCC + COG + GOCC)
0.5 + 0.7 + 0.4
(PCC + GOMF + GOBP)
0.5 + 0.7
(COG + GOBP)
In this work, we proposed a method to build the protein-interaction map in M. tuberculosis by machine learning and interologs. We obtained the interaction features of genetic codon underlying interacting proteins in relatively well-established interactome of E. coli. The features of genetic codons of interacting proteins of E. coli were mapped to the proteome of M. tuberculosis by training an SVM classifier. The cross validation showed the effectiveness and efficiency of our predictor. We also implemented the interolog method to map the documented protein interactions of other organisms into M. tuberculosis. Moreover, the available functional genomic information about M. tuberculosis has been used to evaluate the predicted interactions. These heterogeneous data were combined in a novel framework to infer the interactions in M. tuberculosis. The predicted pairs were checked and can be filtered with these information for potential applications. The constructed protein interaction network of M. tuberculosis provides more information for the infectious bacterium threatening human health.
We used multiple sources of available functional genomic data to provide evaluation of these predicted interactions. Gene coexpression, evolutionary relationship and functional similarity are implemented to check the reliability in the targeted pairs. The information could be directly used to build the functional relationship of protein pairs [23–25]. Due to the limited knowledge in M. tuberculosis, we integrated the heterogeneous information in an alternative framework for assessing the predictions rather than predicting the interactions. Filtering interactions by different confidence values result in different networks of different size and reliability. This will provide valuable resources for biological information in tuberculosis research, which implies the promising applications based on our constructed protein interaction map, which are our future research topics.
Prediction performances of the SVM predictor in these known protein interactions of M. tuberculosis
Basically, we implemented two pipelines of building the protein-interaction map of M. tuberculosis, i.e., the SVM-based machine learning method and the interolog mapping method. The two methods are essentially close-related. The gene sequence information of interacting pair of proteins has been learned by the predictor and that of these known interactions is mapped to the protein pairs of M. tuberculosis. In the same manner, the interolog method identifies the interaction between a pair of proteins which have interacting homologs in another organism. The protein sequence information of known interaction is mapped by the cross-species sequence similarity detection. It is an interesting research topic to identify the quantitative relationship between the prediction results of the two methods. The various mapping schemes of the sequence information have been integrated in our predictions. The gene sequence information as well as the protein sequence information is exploited to infer the protein-interaction map of M. tuberculosis. The other research direction is to implement other schemes to encode the sequence information in the machine learning method, such as the autocorrelation encoding scheme  and triplet residues method . We combined the gene sequence information and the protein sequence information into an integrated framework. It is also an interesting topic to investigate the prediction difference of the two-level sequence information.
In conclusion, we provided a novel framework to integrate genomic data to infer a protein interaction map of M. tuberculosis. We predicted the protein interactions in M. tuberculosis by an SVM based classifier by genetic codons. And the documented protein interactions from various species were also mapped to the proteome of M. tuberculosis by interolog method. The information from gene expression, evolutionary and functional relationship provided reliability measures of evaluating our predictions. The validations provided clear evidence for the effectiveness of our method. Our framework can easily be extended to infer the large-scale protein interaction map in other species. These predicted interactions provide a valuable reference of interactome for M. tuberculosis research. The PPIs build a frame to further study the functional implications underlying the interactome of M. tuberculosis. They are listed in Additional file 2. The details are available at: http://www.aporc.org/doc/wiki/MTBPPI.
We used the SVM method  as the classifier. The software libsvm 2.84  was employed and a radial basis function was chosen as the kernel function in our implementation. The positive pairs of training are those known interactions which are experimentally validated in EcID. There were 14,058 pairs of positive interactions. We selected the negative set by choosing the pairs when the length of shortest path between the two terminals in EcID network is larger than a given cutoff of 6 for the small-world property of a complex network . There is few possibility for two proteins to interact with each other when the distance is bigger than the threshold. There were 27,882 pairs of proteins which are included in the negative set. A five-fold cross validation process was implemented to test the accuracy of our SVM-based classifier. We applied the trained predictor to infer the protein interactions in M. tuberculosis.
where TP, TN, FP and FN refer to number of true positive, number of true negative, number of false positive, and number of false negative predictions, respectively.
We constructed the protein interaction map of M. tuberculosis by genetic codons and ortholog mapping. We also deposited known interactions in databases from experimental results about M. tuberculosis in literatures. Integrated with these known protein interactions, we built a comprehensive PPI map of M. tuberculosis. We collected multiple available resources to access the constructed protein interaction map in M. tuberculosis. The confidence of interactions was evaluated by three extra data sources, namely, gene expression, evolutionary relationship and functional similarity.
where anno(c) is the number of proteins annotated with this terms in our database. The set of child nodes of term c is the children(c). The probability of a term t is then defined as p(c) = freq(c)/freq(root), where freq(root) is the frequency of the root term [37, 38]. We used semantic similarity measures [36–38] to evaluate the similarity of GO term lists corresponding to the interacting proteins. Based on these validations, we can check those interactions consistently validated in various information and detect an ensemble protein network by omitting low reliability pairs.
support vector machine
receiver operating characteristic
area under curve
biomolecular interaction network database
biological general repository for interaction datasets
molecular interaction database
opening reading frame
cluster of orthologous group
national center for biotechnology information
gene expression omnibus.
This work was supported by Shanghai Natural Science Foundation under Grants No. 11ZR1443100 and by the Chief Scientist Program of Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS) under Grant No. 2009CSP002, and by the Knowledge Innovation Program of SIBS of CAS with Grant No. 2011KIP203 and No. KSCX2-EW-R-01. The first author gratefully acknowledges the support of SA-SIBS Scholarship Program. Part of the authors were also supported by the National Natural Science Foundation of China (NSFC) under Grant No. 31100949, 61072149 and 91029301, and partially supported by Aihara Project, the FIRST program from JSPS, initiated by CSTP.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 7, 2012: Advanced intelligent computing theories and their applications in bioinformatics. Proceedings of the 2011 International Conference on Intelligent Computing (ICIC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S7.