Computational reconstruction of proteome-wide protein interaction networks between HTLV retroviruses and Homo sapiens

Background Human T-cell leukemia viruses (HTLV) tend to induce some fatal human diseases like Adult T-cell Leukemia (ATL) by targeting human T lymphocytes. To indentify the protein-protein interactions (PPI) between HTLV viruses and Homo sapiens is one of the significant approaches to reveal the underlying mechanism of HTLV infection and host defence. At present, as biological experiments are labor-intensive and expensive, the identified part of the HTLV-human PPI networks is rather small. Although recent years have witnessed much progress in computational modeling for reconstructing pathogen-host PPI networks, data scarcity and data unavailability are two major challenges to be effectively addressed. To our knowledge, no computational method for proteome-wide HTLV-human PPI networks reconstruction has been reported. Results In this work we develop Multi-instance Adaboost method to conduct homolog knowledge transfer for computationally reconstructing proteome-wide HTLV-human PPI networks. In this method, the homolog knowledge in the form of gene ontology (GO) is treated as auxiliary homolog instance to address the problems of data scarcity and data unavailability, while the potential negative knowledge transfer is automatically attenuated by AdaBoost instance reweighting. The cross validation experiments show that the homolog knowledge transfer in the form of independent homolog instances can effectively enrich the feature information and substitute for the missing GO information. Moreover, the independent tests show that the method can validate 70.3% of the recently curated interactions, significantly exceeding the 2.1% recognition rate by the HT-Y2H experiment. We have used the method to reconstruct the proteome-wide HTLV-human PPI networks and further conducted gene ontology based clustering of the predicted networks for further biomedical research. The gene ontology based clustering analysis of the predictions provides much biological insight into the pathogenesis of HTLV retroviruses. Conclusions The Multi-instance AdaBoost method can effectively address the problems of data scarcity and data unavailability for the proteome-wide HTLV-human PPI interaction networks reconstruction. The gene ontology based clustering analysis of the predictions reveals some important signaling pathways and biological modules that HTLV retroviruses are likely to target. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-245) contains supplementary material, which is available to authorized users.

are still very small. For instances, the latest HIV-human PPI database [6] contains 3,638 interactions, the P.falciparum-H.sapiens PPI dataset [7] contains 1,112 interactions, and the smallest Salmonella-human PPI dataset [8] contains just 62 interactions. Schleker et al. [9] used HT-Y2H (high-throughput yeast-two-hybrid) to detect 166 interactions between HTLV (Human T-cell lymphotropic viruses) and human proteins. Such small pathogen-host PPI datasets are prone to yield model overfitting.
Most of the reported computational methods for pathogen-host PPI prediction focus on the pathogens like HIV-1 [10][11][12][13][14], P.falciparum [15], Salmonella [16][17][18], etc., and generally leverage multiple biological feature information as shown in Table 1. Integration of feature information truly improves the model performance to a certain degree, but it has the two major demerits: (1) aggregation of multiple feature information without augmenting the training data is prone to cause model overfitting on small training data; (2) integration of feature information poses demanding data constraints on the computational modeling. When the feature information is not available to test data, the trained model will fail to work. Thus, how to effectively substitute for the potentially missing feature information is a major issue of computational modeling. In [17,19], the missing feature information such as gene ontology (GO) and gene expression was elaborately substituted with the homolog GO knowledge and protein sequences. The feature information of protein sequences, though cheap to obtain, is criticized for its poor predictive power [20].
As a member of the family of retroviruses, Human Tcell lymphotropic viruses (HTLV) are divided into two sub-types. The type 1 virus (HTLV-1) is known to induce Adult T-cell Leukemia/Lymphoma (ATL), but what diseases are caused by the type 2 virus (HTLV-2) remain unclear [9]. The HT-Y2H (high-throughput yeast-twohybrid) [21,22] was used to yield 166 interactions between HTLV and human proteins. However, this HT-Y2H study validated only three interactions between HTLV-1 Tax and three human proteins (Nup62, MAD1L1, Cdc23) that have been collected in the databases VirusMINT [23] and VirHostNet [24]. Since there are 145 HTLV-human PPIs in the two databases, this HT-Y2H study achieves only 2.1% recognition rate of experimentally derived PPIs. Such a low recognition rate is partly caused by different sensitivity of experimental methods to different types of interaction. Computational modeling can shield the lowlevel biochemical specificity (e.g. covalent modification) of protein-protein interactions to set up a general-purpose PPI predictor. To our knowledge, no computational method has been developed for fast reconstruction of proteome-wide HTLV-human PPI networks.
In this work, we propose a computational method that addresses the problems of data unavailability and data scarcity for reconstructing proteome-wide HTLV-human PPI networks. The homolog knowledge, in terms of gene ontology (GO), is treated as auxiliary homolog instances to mingle with the target instances (the GO knowledge of the proteins themselves), such that (1) the homolog instances augment the training data to reduce the risk of model overfitting; (2) the feature information is enriched to make up for data scarcity; (3) the homolog instances are used as substitute when the target instances are not available. It is noted that such a way of homolog knowledge transfer may introduce a certain level of noise that results from evolutionary divergence. On the basis of the original instance reweighting AdaBoost [25,26], we propose Multi-instance Adaboost to attenuate the noise from homolog instances. The model performance is evaluated by 10-fold cross validation and independent test. Last, we use Multi-instance AdaBoost to reconstruct the proteomewide HTLV-human PPI networks and further conduct gene ontology based clustering analysis of the predictions to gain insight into the pathogenesis of HTLV retroviruses.

Data and materials
The training data are collected from two sources, one dataset is from [9] that contains 166 interactions (hereinafter called S1 pos ), and the other dataset is from the two databases [23,24] that contains 145 interactions (hereinafter called S2 pos ). After removing those putative/uncharacterized/uncurated/hypothetic HTLV proteins and those HTLV proteins that have no corresponding accessions in the Uniprot database (http://www.uniprot.org/uniprot/), S1 pos is reduced to 155 interactions between 9 HTLV proteins and 112 human proteins. Accordingly, the negative data of equal size are randomly sampled for S1 pos and, called S1 neg , S2 neg , respectively. Then the two training data are defined as S1 = S1 pos ∪ S1 neg and S2 = S2 pos ∪ S2 neg , and the whole training data is defined as S = S1 ∪ S2. It is noted that each training data are actually doubled in size, because each data point is represented with two instances, i.e., the target instance and the homolog instance. Binding motif, gene expression profile, gene ontology, sequence similarity, post-translational modification, tissue distribution, PPI network topology [10,11] Protein domain profile, sequence k-mer [12] Structural similarity [13] Protein domain profile, gene expression, gene ontology, gene co-expression [15] GO feature construction The homologs are extracted from SwissProt database [27] using PSI-BLAST [28] (E-value = 10) and the gene ontology (GO) terms are extracted from GOA database [29]. To increase the coverage of homologs, we adopt default E-value (E-value = 10) of PSI-BLAST and search for the space of all the species available in SwissProt database. For each protein i, we obtain two sets of GO terms, one set contains the GO terms from the homologs denoted as homolog set S i H , and the other set contains the GO terms from the protein itself denoted as target set S i T . Based on the denotations, we can formally define two feature vectors for a protein pair (i 1 , i 2 ) as follows: . In practical implementation, each GO term g is assigned an integer index. Formula (1) means that if the protein pair (i 1 , i 2 ) shares the same GO term g, then the corresponding component in the feature vector B is set 2; if neither protein in the protein pair possesses the GO term g, then the value is set 0; otherwise the value is set 1. The above definition is symmetrical, i.e., the protein pair (i 1 , i 2 ) and the protein pair (i 2 , i 1 ) have identical feature representation.

Multi-instance AdaBoost
In the scenario of traditional machine learning, data point is generally represented with only one instance, whereas only one instance is not enough to depict a biological molecule (e.g. protein, DNA, RNA) in computational studies. For instance, a series of multi-aspect information is needed to depict the temporal and spatial information of DNA transcription, protein folding, etc. Moreover, evolutionary information may be needed to provide abundant knowledge about the biological molecule concerned. To address the problem, we are motivated to explore multiinstance learning to enrich protein information by representing proteins with more than one instance.
Here we depict each protein with two instances, one instance called target instance is used to represent the gene ontology (GO) information of the protein itself, and the other instance called homolog instance is used to represent the GO information of the homologs. The homolog instance is used to capture the evolutionary information as well as to enrich the feature information of the target instance. Meanwhile, the homolog instance also plays an important role in tackling the problem of data unavailability. When the feature information indispensible for prediction is not available, the homolog instance can be treated as substitute for the target instance to guarantee that the model still works. However, in some cases the homolog instances are likely to carry noise because of evolutionary divergence, thus it is not proper to treat the two kinds of instances equally. One way to solve the problem is to assign different weight distributions to the two kinds of instances, so that the predictive model can be boosted to generalize well. To our knowledge, Ada-Boost [25,26] is a boosted ensemble classifier that iteratively reweight the instances according to the difficulty of classification. AdaBoost instance reweighting [25,26] is defined as follows: where x i denotes the i-th training instance, y i denotes its class label, f m (x i ) denotes the decision value predicted by the committed obtained in the m-th round of training, D m (i) denotes the weight of the i-th training instance in the m-th round of training, and Z m denotes the normalizer. From Formula (1), we can see that Ada-Boost assigns high weights to those hard-to-classify instances and assigns low weights to those easy-to-classify instances for the next round of training. This idea of iterative reweighting of the training samples is essential to Boosting. Intuitively speaking, the examples that are misclassified get higher weights in the next iteration, for instance, the noisy/outlier examples near the decision boundary are usually harder to classify and therefore get high weights after a few iterations [30]. In [30], it has been theoretically proven that the boosted ensemble classifier achieves a large margin between two-class hyperplanes through multiple rounds of instances reweighting. From a theoretical point of view, AdaBoost implicitly penalizes the ℓ 1 norm [27], and the regularization technique penalizes the impact of noise/outlier at the cost of higher training error to achieve lower generalization error.
The latest AdaBoost (Modest AdaBoost, [26]) combines the distribution of instance weights and its inverted distribution into a decision function to make the decision "soft" (see Additional file 1). Based on Modest AdaBoost, we develop the Multi-instance AdaBoost method to conduct homolog knowledge transfer. As compared to singleinstance AdaBoost, Multi-instance AdaBoost shows no much difference in the training phase, except that each protein pair is represented by two instances B where | • | denotes absolute value. Then the final label for (i 1 , i 2 ) is defined as below:

Model evaluation
We design three experimental settings to validate the effectiveness of the proposed Multi-instance Adaboost.
The first setting is Single-instance AdaBoost, used as the baseline model to evaluate the performance gain from homolog instances. In this setting, each protein pair is represented by the target instance B The second setting is Multi-instance AdaBoost Novel, deliberately designed to evaluate the model robustness to data unavailability. In this setting, the training data are represented by two kinds of instances, while the test data are represented with homolog instances alone to simulate data unavailability. The third setting is Multi-instance AdaBoost, designed to evaluate the model capability of overcoming data scarcity. In this setting, both the training data and the test data are represented by the two kinds of instances. We estimate the model performance for the three settings using 10-fold cross validation and independent test. Receiver Operating Characteristic (ROC) AUC (Area Under Curve) (referred to as ROC-AUC), Precision recall curve AUC (PR-AUC), Specificity (SP), Sensitivity (SE), MCC (Matthews correlation coefficient), F1 score and Overall Accuracy are adopted as performance metrics. The formal definitions of the performance metrics are given in the Additional file 1.

Model performance evaluation
Before proteome-wide predictions, we first evaluate the reliability of Multi-instance AdaBoost. In [9], the experimental HT-Y2H recognized 166 interactions and validated only three interactions out of the 145 interactions collected in the two databases [23,24]. Of the two datasets, the former dataset [9] is processed and named as S1 pos in this work, and the latter dataset [23,24] is named as S2 pos . Through random sampling we obtain the corresponding negative datasets S1 neg , S2 neg for the two positive datasets S1 pos and S2 pos , respectively. Thus we obtain two training datasets: S1 = S1 pos ∪ S1 neg and S2 = S2 pos ∪ S2 neg . To our knowledge, there is no existing computational method for HTLV-human PPI prediction, so we use the HT-Y2H recognition rate of novel PPIs [9] as the baseline performance. To compare with HT-Y2H, we use S1 to train Multi-instance AdaBoost and then check how many interactions out of S2 pos can be correctly recognized. This evaluation is actually an independent test, i.e., S2 pos is used as an independent test set to validate the model that is trained on S1. Before validating S2 pos , we conduct 10-fold cross validation model evaluation on the training data S1.

10-fold cross validation model evaluation
The results of 10-fold cross validation for the three settings on dataset S1 are illustrated in Figures 1, 2 and Table 2. We use the setting Single-instance AdaBoost as the baseline to demonstrate the effectiveness of homolog knowledge transfer by means of independent homolog instances.
Multi-instance AdaBoost versus single-instance AdaBoost. From Figures 1 and 2, we find that Multiinstance AdaBoost significantly outperforms the baseline setting Single-instance AdaBoost, with ROC-AUC 0.8210 versus 0.7655 and PR-AUC 0.7743 versus 0.6971, respectively. From Table 2, we also find that that Multiinstance AdaBoost shows significantly better performance than Single-instance AdaBoost with overall Accuracy 79.03% versus 69.58%. The results suggest that the homolog instances are effective to enrich the feature information and solve the problem of data scarcity. Further details in Table 2 provide additional information about the predictions. For the three settings, the recall rates (sensitivity, SE) of the positive class (interaction) are generally higher than those of the negative class (non-interaction), and conversely the specificity (SP) values of the positive class (interaction) are generally lower than those of the negative class (non-interaction), suggesting that the negative class yields larger misclassification rate than the positive class. To reduce the misclassification rate, we need improve the quality of the sampled negative data. At present, there is no experimentally derived golden-standard non-interaction data, and random sampling is often used as an alternative to obtain the negative data. As we know, random sampling is prone to sample false negative data and thus introduce a certain level of noise. How to sample quality negative data deserves our future study. In this study, random sampling seems to introduce no obvious predictive bias in the three settings from the points of view of the MCC values on the positive class and the negative class, e.g., Multi-instance AdaBoost (0.6498, 0.6397), Multiinstance AdaBoost Novel (0.5611, 0.5416) and Singleinstance AdaBoost (0.5192, 0.5001).

Multi-instance AdaBoost novel versus single-instance
AdaBoost. From Figures 1 and 2, we find that Multiinstance AdaBoost Novel still outperforms the baseline setting Single-instance AdaBoost, with ROC-AUC 0.7846 versus 0.7655 and PR-AUC 0.7521 versus 0.6971, respectively. From Table 2, Multi-instance AdaBoost also shows better performance than Single-instance AdaBoost with overall Accuracy 72.58% versus 69.58%. The results, though not so significant as Multi-instance AdaBoost, still suggest that the homolog instances are effective to substitute for the target instances and thus securely avoid model failure when the gene ontology knowledge is not available.
Multi-instance AdaBoost versus other pathogenhost PPI predictors. We can not conduct direct model comparison because no computational model has been developed for HTLV-human PPI prediction thus far. For rough knowledge about the reliability of Multi-instance AdaBoost, we conduct indirect comparisons with two representative pathogen-host PPI predictive models. One model is the semi-supervised multi-task learning method for HIV-human PPI prediction [11] and the other model is the random forest for Salmonella-human PPI prediction [17]. The model for HIV-human PPI prediction is trained on large data (2,277 interactions) and achieves 0.919 ROC-AUC score, whereas the model for Salmonella-human PPI prediction is trained on rather small data that contains only 66 interactions and achieves 0.52 F1 score. We can see that the size of training data is one of the factors that have large influence on the model performance. Comparatively, the proposed Multi-instance AdaBoost achieves 0.8210 ROC-AUC score and 0.80 F1 score. In terms of training data size, the Multi-instance AdaBoost model trained on 155 interactions is much closer to the Salmonella-human PPI prediction model  (66 interactions) than to the HIV-human PPI prediction model (2,277 interactions). Nevertheless, Multi-instance AdaBoost achieves a significantly higher F1 score than the Salmonella-human PPI prediction model (0.80 versus 0.52). Moreover, Multi-instance AdaBoost achieves at least 0.7419 SE on the positive class, also significantly higher than the Salmonella-human PPI prediction model (SE 0.407). These rough comparisons, though based on different data, suggest that the proposed Multi-instance AdaBoost performs well on small data.

Independent test on the data from recent databases
As mentioned above, the experimental HT-Y2H [9] reproduced only three interactions out of the 145 interactions collected from VirusMINT [23] and VirHostNet [24], accounting for 2.1% recognition rate. The result suggests that HT-Y2H is effective to some specific proteinprotein interactions (e.g. transient interaction) but is prone to yield rather high false negative rate for other types of interaction. Furthermore, not only is the overlap between different experimental results rather small, but also the overlap between the computationally reconstructed network and the experimentally derived network is neither large. As reported in [11], the semi-supervised multi-task learning model validated only 10% HIV-human PPIs derived by siRNA screen. The low network overlap may suggest two points: (1) different experimental techniques should be treated as mutual complements to detect different types of protein-protein interaction, or (2) the computational methods need further improvement to generalize well.
The results of 10-fold cross validation shows that the proposed Multi-instance AdaBoost achieves better performance on small data than other existing pathogenhost PPI predictor [17]. Here we further conduct an independent test to compare with the experimental HT-Y2H [9] by examining how many interactions out of the 145 interactions (S2 pos ) can be correctly recognized by Multi-instance AdaBoost. The independent test is actually a validation on the positive data S2 pos with negligence of the negative data S2 neg , as we are more concerned about the recognition rate of the known PPIs. For this reason, we train Multi-instance AdaBoost on the dataset S1 and use the model to predict S2 pos . Notably, Multi-instance AdaBoost can correctly recognize 102 interactions out of the total 145 interactions (S2 pos ), accounting for 70.3% recognition rate, much larger than HT-Y2H 2.1% recognition rate [9] and 10% overlap between predictions and siRNA screen [11]. The overlap between the networks predicted by Multi-instance AdaBoost and derived by HT-Y2H is given in Additional file 2.

Proteome-wide PPIs prediction and gene ontology based clustering analysis Proteome-wide PPIs prediction
In this section we exploit the PPI data available [9,23,24] to train Multi-instance AdaBoost for proteome-wide HTLVhuman PPI networks reconstruction. Before predictions, we also conduct 10-fold cross validation model evaluation on the whole dataset S. The results are equivalent to the 10-fold cross validation performance on the dataset S1 (see Additional file 1: Figure S1, Figure S2 and Table S1).
In the dataset S, there are 9 HTLV proteins that have corresponding reviewed accessions in the Uniprot database. The human proteins are taken from the file uni-prot_sprot_human.dat.gz available at ftp://ftp.uniprot. org/pub/databases/uniprot/ current_release/knowledgebase/taxonomic_divisions/. After removing those uncurated/putative/uncharacterized proteins and those proteins that are already used as training data, we finally obtain 20,334 human proteins as the candidate targets of the 9 HTLV proteins. Hence there are totally 183,006 (9 × 20,334) protein pairs to be predicted. We use the trained Multiinstance AdaBoost to predict all the 183,006 protein pairs and detect 61,846 novel interactions (see Additional files 2 and 3), accounting for 33.79% predicted positive rate. Among the 20,334 human proteins, there are totally 10,445 human proteins predicted to interact with the 9 HTLV proteins, that's to say, about 50% of the known human proteins are predicted to be potentially targeted by HTLV proteins. The result suggests that the proposed Multi-instance AdaBoost yields a certain degree of false positive predictions. The risk of false positive is a hard problem to both computational modeling and small training data is the major concern of computational modeling.
In this work, we propose Multi-instance AdaBoost to address the problems of data scarcity and data unavailability for proteome-wide HTLV-human PPI networks reconstruction. In this method, the gene ontology knowledge of the homologs is treated as independent homolog instance to augment the training data, so that the feature information is enriched to make up for data scarcity and reduce the risk of model overfitting. Meanwhile, the homolog instances are treated as substitute for the potentially missing target instances to address the problem of data unavailability. However, since the homolog instances are likely to carry a certain level of noise due to evolutionary divergence, we resort to AdaBoost instance reweighting to attenuate the impact of noise. AdaBoost has been theoretically proven to maximize the margin between two-class hyperplanes by penalizing the impact of noise/ outlier. As compared to other existing pathogen-host PPI predictive models [17,18], the proposed Multi-instance AdaBoost has several advantages: (1) the homolog knowledge is used to augment the training data and thus to reduce the risk of model overfitting; (2) the homolog knowledge is used as substitute to address the problem of data unavailability; (3) the noise from homolog knowledge transfer is attenuated by AdaBoost instance reweighting algorithm. Comparatively, a drawback of Multi-instance AdaBoost is that the other feature information except gene ontology is not integrated into the model. We should achieve balance between data constraint and data enrichment in the future research.
To validate the assumptions that the homolog instances are effective to address the problems of data scarcity and data unavailability, we design three experimental settings, i.e. Multi-instance AdaBoost, Multi-instance AdaBoost Novel and Single-instance AdaBoost, and conduct 10-fold cross validation experiments & independent tests for each setting, using multiple performance metrics (SP, SE, Accuracy, MCC, ROC-AUC, PR-AUC). The experimental results demonstrate these points: (1) Multi-instance AdaBoost significantly outperforms the baseline Singleinstance AdaBoost, suggesting that the homolog instances are effective to augment the training data; (2) Multiinstance AdaBoost Novel still outperforms the baseline Figure 3 The predicted HTLV-human PPI sub-network GO:0000187 (biological process: activation of MAPK activity). The green node denotes HTLV protein and the red node denotes human protein.
Single-instance AdaBoost, suggesting that the proposed Multi-instance AdaBoost can still work well when the feature information of the proteins to be predicted is not available; (3) Multi-instance AdaBoost correctly recognize 70.3% of the known PPIs, significantly higher than HT-Y2H 2.1% recognition rate; (4) Indirect comparisons show that Multi-instance AdaBoost outperforms the existing pathogen-host PPI predictive models that were trained on small datasets.
Lastly, we apply Multi-instance AdaBoost to reconstruct the proteome-wide HTLV-human PPI networks and conduct gene ontology based clustering analysis of the predicted networks. The clustering analysis gains much insight into the pathogenesis of HTLV retroviruses and provides valuable clues for further experimental studies.

Conclusion
The computational modeling for pathogen-host PPI networks reconstruction needs to address the major concerns of data scarcity and data unavailability. In this paper, we propose a novel method Multi-instance Ada-Boost to augment the training data. Experimental results show that the homolog knowledge transfer by means of independent homolog instances is effective to enrich the information abundance and to help the model work properly when the feature information is not available. Moreover, the gene ontology based clustering of the proteome-wide predicted HTLV-human PPI networks provides valuable clues for further biomedical research.

Additional files
Additional file 1: Brief description of Modest AdaBoost [26]. Formal definitions of SP, SE, MCC, overall accuracy (Acc) and F1 score. Figure S1 ROC curve for three experimental settings (Multi-instance AdaBoost, Multi-instance AdaBoost Novel, Single-instance AdaBoost) on the dataset S. Figure S2 Precision-Recall curve for three experimental settings (Multi-instance AdaBoost, Multi-instance AdaBoost Novel, Single-instance AdaBoost) on the dataset S. Table S1 10-fold cross validation performance estimation on on the dataset S. Figure S3 The predicted HTLV-human PPI sub-network GO:0006120 (biological process: mitochondrial electron transport, NADH to ubiquinone). The green node denotes HTLV protein and the red node denotes human protein. Figure S4 The predicted HTLV-human PPI sub-network GO:0005344 (molecular function: oxygen transporter activity). The green node denotes HTLV protein and the red node denotes human protein.
Additional file 2: Text file contains the overlapped interactions between Multi-instance AdaBoost and the experimental technique HT-Y2H [9].
Additional file 3: Text file contains the predicted interactions. Figure 4 The predicted HTLV-human PPI sub-network GO:0003743 (molecular function: translation initiation factor activity). The green node denotes HTLV protein and the red node denotes human protein.