Inferring a protein interaction map of Mycobacterium tuberculosis based on sequences and interologs

Background Mycobacterium tuberculosis is an infectious bacterium posing serious threats to human health. Due to the difficulty in performing molecular biology experiments to detect protein interactions, reconstruction of a protein interaction map of M. tuberculosis by computational methods will provide crucial information to understand the biological processes in the pathogenic microorganism, as well as provide the framework upon which new therapeutic approaches can be developed. Results In this paper, we constructed an integrated M. tuberculosis protein interaction network by machine learning and ortholog-based methods. Firstly, we built a support vector machine (SVM) method to infer the protein interactions of M. tuberculosis H37Rv by gene sequence information. We tested our predictors in Escherichia coli and mapped the genetic codon features underlying its protein interactions to M. tuberculosis. Moreover, the documented interactions of 14 other species were mapped to the interactome of M. tuberculosis by the interolog method. The ensemble protein interactions were validated by various functional relationships, i.e., gene coexpression, evolutionary relationship and functional similarity, extracted from heterogeneous data sources. The accuracy and validation demonstrate the effectiveness and efficiency of our framework. Conclusions A protein interaction map of M. tuberculosis is inferred from genetic codons and interologs. The prediction accuracy and numerically experimental validation demonstrate the effectiveness and efficiency of our method. Furthermore, our methods can be straightforwardly extended to infer the protein interactions of other bacterial species.


Background
M. tuberculosis which causes tuberculosis affecting lungs and other organs is the second largest cause of death from infectious diseases [1]. An extensive proteinprotein interaction (PPI) network of M. tuberculosis can lead to more comprehensive screens of cellular operations. In this context, development of approaches to infer its interactome will contribute to identifying infectious mechanisms, detecting important drug target proteins and promoting potential therapy innovations. To date, genome-wide experimental and computational systems for studying PPIs in M. tuberculosis are unavailable [2]. It is necessary to develop approaches capable of converting available genomic data into functional information of protein-interaction map for M. tuberculosis. E. coli is one of the best model systems to study bacterial physiology [3], with relatively well-characterized interactome, genome and transciptome [4]. It is believed that the protein interactions are conserved in different organisms [5]. The interaction features can be learned by machine learning methods, such as support vector machines (SVMs) [6,7], and also it is common to predict protein interactions from the known interactions of other organisms by interolog method [8].
Compared with other methods, sequence-based prediction methods are superior for their simple requirement on the data, which could be implemented when the species have completely sequenced genomes. There were some studies that are based on sequence information have been successfully performed on PPI prediction of some model organisms such as H. sapiens, S. cerevisiae and E. coli [6,[9][10][11][12]. However, a limitation of these methods is the requirement of large size of training data to meet a satisfactory accuracy criterion. For model organisms, we have a large volume of prior PPIs that can be used as training data, but there are few experimental data of PPI for some dangerous bacteria like M. tuberculosis. Thus, a novel integration method is necessary to be developed. In this work, we provided crossspecies PPI predictions in M. tuberculosis by integrating different types of protein interaction information of other species. Genetic information in the form of codons, i.e. tri-nucleotide sequences, are translated into proteins [13]. It is well known that codon usage is correlated with expression level [9,13]. The codon which carries genetic information specifies the amino acid sequence in the polypeptide during the synthesis of proteins. The genetics of coding sequences is not only the blueprint for translating amino acids, but also the continuous original information for genetic transcription of gene expression. Here, genetic codons will be selected as the sequence features in the learning of interaction patterns. Moreover, the corresponding orthologs of interacting proteins in other organisms will provide more information about the potential interaction mappings by comparative genomics.
In this work, we developed a systematic method combining heterogeneous data sources to infer a comprehensive protein interaction map for M. tuberculosis. The codon features of interacting protein pairs are detected and used to train an SVM classifier. Then the interactome of M. tuberculosis is predicted by the codon-based method. Moreover, the interactions from 14 other species are mapped to M. tuberculosis by the interolog method. The available data from multiple levels including gene coexpression, evolutionary relationship and functional similarity are implemented to assess these predicted interactions by confidence significance. The evidence from various sources validates the effectiveness of our method. The network properties of the constructed protein-interaction map are also identified. The predicted protein interaction network as well as the proposed method provide a framework for the functional specificities study of M. tuberculosis.

Results
Predictor performance E. coli is one of the best characterized organisms [3,4] and we chose it as a model system for building the protein interaction map of M. tuberculosis. The positive and negative sets of protein interactions in E. coli were designed to test the performance of our codon-based prediction methods. The genome and proteome of E. coli were downloaded and prepared for the interacting sets as well as all known opening reading frames (ORFs) [14]. The distance of two ORFs in terms of usage of codon c is defined as where f i (c) and f j (c) are relative frequencies of codon c in ORF i and ORF j. By codon definition, ∑ k f i (c k ) = 1 and ∑ k f j (c k ) = 1 for k = 1, 2,..., 64 in all codons. There are 14058 pairs of interactions and 27882 pairs of non-interactions in 4227 proteins of E. coli. A five-fold cross validation process is implemented in these pairs, i.e., we train the SVM classifier based on the related codons in the 80% interacting pairs forming the training part and test the prediction in the rest part. Figure 1 shows the performance of prediction results of receiver operating characteristic (ROC) curves by the SVM predictor using genetic codon features. As we know, there are several codons corresponding to the same amino acid in genetic code. The prediction performance of merging the frequency of these degenerate codons ('codon-mer') is also shown in Figure 1. The details of prediction precision and accuracy are listed in Table 1. The SVM predictor can achieve the prediction accuracy (ACC) of 0.9003 and the area under curve (AUC) of 0.9507 in the PPI of E. coli. These results provide pieces of evidence for the effectiveness and efficiency of predicting protein interactions from the genetic codons by machine learning method.

Protein interactions in M. tuberculosis
To explore protein interactions in M. tuberculosis, we used the formerly trained SVM classifier to infer the interactions of M. tuberculosis by the codon message of ORFs in gene sequence level. Based on the genetic codons of the laboratory strain H37Rv of M. tuberculosis, we predicted 12,899 interactions in 3,266 proteins. Furthermore, the known protein interactions of other species were mapped to the proteome of M. tuberculosis by interolog method. We collected the documented interactions of 14 species from PPI databases, IntAct and DIP, and the sequence features of interacting proteins were transferred into the M. tuberculosis proteome by ortholog detection. Table 2 lists the detailed prediction results by interolog method. So far, we also found 530 pairs of protein interactions of M. tuberculosis from various databases, such as BIND [15] and Reactome [16]. Combining with these known interactions, we built a comprehensive protein interaction map totally with 46,119 interactions of 3,465 proteins in M. tuberculosis. The inferred protein interaction map of M. tuberculosis is shown in Figure 2.

Validation results
Interacting protein pairs have been identified with close relationship of gene coexpression [17], coevolution [18], similar GO annotations [19], phenotype association and similar physicochemical elements [20]. For M. tuberculosis species, we got these available heterogeneous data sources to annotate every predicted interacting pairs.
Firstly, we annotated the predicted interacting protein pairs by their corresponding Pearson's correlation coefficient (PCC) of gene coexpression. For comparison, we calculated the corresponding correlation values of these same-size random selected protein pairs. Every prediction was then annotated by a coexpression value in gene expression profiling. Figure 3(a) shows the boxplot of coexpression values in the predictions. From Figure 3(a), we identified that the coexpression values in the predicted interacting pairs tend to be more correlated when compared to the same-size randomly selected pairs (P-value = 4.69 × 10 -3 , Mann-Whitney U test). Secondly, we identified the evolutionary relationship of the interacting proteins by the clusters of orthologous group (COG) information. The interacting proteins were detected in their own COG individually. Figure 3(b) shows the boxplot of evolutionary relationship values in the predicted interacting pairs and that of the same-size randomly selected protein pairs. Their difference measured by the Mann-Whitney U test (P-value = 0.53) is not significant, while    every predicted interaction gets a confidence value of evolutionary relationship. Thirdly, we calculated the functional similarities underlying these predicted interactions. We detected the semantic similarity between the gene ontology (GO) term pairs of interacting proteins. We have considered the hierarchical structure of GO directed acyclic graph and the specificity of GO terms in the identification. We identified the functional relationships between predicted interacting proteins by three ontology categories, i.e. cellular component ('CC'), molecular function ('MF') and biological process ('BP'), respectively. The boxplots of the three values of GO similarities in the random pairwise proteins and that in predicted pairs are shown in Figure 3 (c), (d) and 3(e), respectively. The predicted interactions have higher values of functional similarity than random ones (P-values are 6.63 ×10 -4 , < 2.2 ×10 -16 and < 2.2 ×10 -16 , respectively), which further provides evidence for the effectiveness of our methods. After annotating from various information, we provide the evaluation confidence to every predicted interactions.

Network analysis
For global views of the protein-interaction map, we identified the topological features of the integrated protein interaction map and the features of particular interactions. Firstly, we detected the original features in the primary constructed network by machine learning and interologs combined with the known interacting protein pairs. The measures of degree distributions, clustering coefficients, characteristic shortest path and network diameter are identified individually. Table 3 lists some network properties. Network diameter is the longest path between any two proteins. The characteristic path length is calculated by averaging minimum distance between protein pairs. Clustering coefficient is a measure of degree to which nodes in a network tend to cluster together [21]. A network whose degree distribution follows a power law is often called a scale-free network [22]. These measures refer to the details of the properties of the inferred protein-interaction map of M. tuberculosis.
The hub proteins as well as interested proteins can be selected to analyze for particular dysfunctions of M. tuberculosis. From the validations of gene coexpression, evolutionary relationship in COGs and functional similarity, we can check and filter out those pairs consistently included in various level information by evaluating the reliability of interactions. We then calculated the features in the filtered network by omitting the pairs with lower confidence values, while we kept the predictions when there are no available evaluations for them. We also identified the distribution of node degree and found that the constructed protein interaction network satisfied the topology features of complex networks [22]. The processes are based on the network analysis of fitting the distribution of a scale-free network, and the parameter g value is asymptotically in the range 1 <g < 2 in the power-law distribution fitting. There are 477 hub proteins in the protein interaction map when the degree threshold is 50. The hub proteins from different thresholds can be found in Additional file 1.

Discussions
In this work, we proposed a method to build the protein-interaction map in M. tuberculosis by machine learning and interologs. We obtained the interaction features of genetic codon underlying interacting proteins in relatively well-established interactome of E. coli. The features of genetic codons of interacting proteins of E. coli were mapped to the proteome of M. tuberculosis by training an SVM classifier. The cross validation showed the effectiveness and efficiency of our predictor. We also implemented the interolog method to map the documented protein interactions of other organisms into M. tuberculosis. Moreover, the available functional genomic information about M. tuberculosis has been used to evaluate the predicted interactions. These heterogeneous data were combined in a novel framework to infer the interactions in M. tuberculosis. The predicted pairs were checked and can be filtered with these information for potential applications. The constructed protein interaction network of M. tuberculosis provides more information for the infectious bacterium threatening human health. We used multiple sources of available functional genomic data to provide evaluation of these predicted interactions. Gene coexpression, evolutionary relationship and functional similarity are implemented to check the reliability in the targeted pairs. The information could be directly used to build the functional relationship of protein pairs [23][24][25]. Due to the limited knowledge in M. tuberculosis, we integrated the heterogeneous information in an alternative framework for assessing the predictions rather than predicting the interactions. Filtering interactions by different confidence values result in different networks of different size and reliability. This will provide valuable resources for biological information in tuberculosis research, which implies the promising applications based on our constructed protein interaction map, which are our future research topics.
In our framework, we proposed a cross-species prediction by mapping the documented interactions of other species into M. tuberculosis. For completeness, we collected some known curated interactions. We also tested the predictions in these known interactions in M. tuberculosis by training the codon features to check our predictions. Figure 4 shows the ROC curves of prediction performance. Our method can achieve high AUC of 0.945 by the codonbased method and of 0.951 by merging the frequency of these degenerate codons in these known protein interactions. Table 4 shows the results of prediction performances. We achieved similar accuracy by merging the codons as that by the codon-based method. In our previous work, we concluded that there is subtle difference between the two encoding schemes for predicting protein interactions [7]. Both methods are rational and their differences are underlying the data sets. The results provide more evidence for the effectiveness and efficiency of our proposed methods.
Basically, we implemented two pipelines of building the protein-interaction map of M. tuberculosis, i.e., the SVMbased machine learning method and the interolog mapping method. The two methods are essentially close-related. The gene sequence information of interacting pair of proteins has been learned by the predictor and that of these known interactions is mapped to the protein pairs of M. tuberculosis. In the same manner, the interolog method identifies the interaction between a pair of proteins which have interacting homologs in another organism. The protein sequence information of known interaction is mapped by the cross-species sequence similarity detection. It is an interesting research topic to identify the quantitative relationship between the prediction results of the two methods. The various mapping schemes of the sequence information have been integrated in our predictions. The gene sequence information as well as the protein sequence information is exploited to infer the protein-interaction map of M. tuberculosis. The other research direction is to implement other schemes to encode the sequence information in the machine learning method, such as the autocorrelation encoding scheme [26] and triplet residues method [6]. We combined the gene sequence information and the protein sequence information into an integrated framework. It is also an interesting topic to investigate the prediction difference of the two-level sequence information.

Conclusion
In conclusion, we provided a novel framework to integrate genomic data to infer a protein interaction map of M. tuberculosis. We predicted the protein interactions in

Methods
Framework of prediction Figure 5 shows our framework to infer the protein interaction map of M. tuberculosis. The protein interactions were predicted by two main pipelines. Firstly, we built the protein interaction network of M. tuberculosis from codon features of interacting proteins in E. coli by a machine learning approach. The integrated interaction map and gene sequences of E. coli were downloaded from EcID, which collects comprehensive PPIs in E. coli by combining various knowledge [4]. We used the information of protein interactions of E. coli to train an SVM classifier to get the genetic codon features underlying these interacting pairs. The interactions in M. tuberculosis were then predicted by the trained SVM predictor with the genetic codons of ORFs in gene sequences of M. tuberculosis. We chose the laboratory strain of H37Rv as our model organism [14]. The processes are shown in the upper-left square frame of Figure 5. Secondly, we inferred the protein interactions of M. tuberculosis by interolog method from the documented protein interactions in 14 other species. We collected these interactions from IntAct [27] and DIP [28]. The interacting proteins of each species were detected their homologous proteins of M. tuberculosis by BLAST [29] individually. The homologs of two interacting proteins will be identified as the predicted interactors. The pipeline is shown in the upper-right square framework of Figure 5. As for the validation of predicted results, we tested our method in E. coli and in the known interactions of M. tuberculosis. Three pieces of available information of M. tuberculosis, i.e., gene expression profiling, evolutionary relationship from ortholog database and functional similarity, were used to evaluate the confidence of prediction results. The known protein interactions were of course included in our constructed interactome of M. tuberculosis. Finally, we inferred an integrated protein interaction map of M. tuberculosis.

SVM-based predictor
We used the SVM method [30] as the classifier. The software libsvm 2.84 [31] was employed and a radial basis function was chosen as the kernel function in our implementation. The positive pairs of training are those known interactions which are experimentally validated in EcID. There were 14,058 pairs of positive interactions. We selected the negative set by choosing the pairs when the length of shortest path between the two terminals in EcID network is larger than a given cutoff of 6 for the small-world property of a complex network [21]. There is few possibility for two proteins to interact with each other when the distance is bigger than the threshold. There were 27,882 pairs of proteins which are included in the negative set. A five-fold cross validation process was implemented to test the accuracy of our SVM-based classifier. We applied the trained predictor to infer the protein interactions in M. tuberculosis.
The prediction performance was evaluated by various parameters, such as sensitivity (SN), specificity (SP), accuracy (ACC) and precision (PRE). The evaluation is usually displayed in a ROC graph with measure of area under curve (AUC). Mathematically, these measures are defined as

Validation from multiple resources
We constructed the protein interaction map of M. tuberculosis by genetic codons and ortholog mapping. We also deposited known interactions in databases from experimental results about M. tuberculosis in literatures. Integrated with these known protein interactions, we built a comprehensive PPI map of M. tuberculosis. We collected multiple available resources to access the constructed protein interaction map in M. tuberculosis. The confidence of interactions was evaluated by three extra data sources, namely, gene expression, evolutionary relationship and functional similarity.
Firstly, we identified the PCCs of gene coexpression of pairwise proteins in the predicted network. We downloaded the gene expression profiling data of M. tuberculosis H37Rv from NCBI GEO (ID: GSE9776) [32]. Correlation between genes is calculated by where μ x i and μ x j are the means of gene expression profile x i and x j , s i and s j are the standard deviations of them. Secondly, we presented the evaluation of evolutionary relationship between the predicted interacting proteins. COGs were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages [33]. Figure 6 presents the method to identify the evolutionary information between the predicted interacting proteins. Each COG consists of individual proteins or groups paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain [33]. The maximum of COG value between two groups in which the interacting proteins are located were regarded as the value representing their evolutionary relationship. Thirdly, GO [34] similarity between the predicted pairs was identified to evaluate their functional relationship. We downloaded the annotations for M. tuberculosis H37Rv from GOA [35]. In GO hierarchical acyclic graph, the terms far from the root would be more informative than those close to the root. We calculated the GO probability for specific GO terms [36]. The frequency of a GO term in a database is defined as Figure 6 Identification of evolutionary relationship between two interacting proteins. The maximum of COG value between two groups in which the interacting proteins are located are used as the value of their evolutionary relationship.

freq(c) = anno(c) + h∈children(c) freq(h),
where anno(c) is the number of proteins annotated with this terms in our database. The set of child nodes of term c is the children(c). The probability of a term t is then defined as p(c) = freq(c)/freq(root), where freq (root) is the frequency of the root term [37,38]. We used semantic similarity measures [36][37][38] to evaluate the similarity of GO term lists corresponding to the interacting proteins. Based on these validations, we can check those interactions consistently validated in various information and detect an ensemble protein network by omitting low reliability pairs.