Multi-label multi-instance transfer learning for simultaneous reconstruction and cross-talk modeling of multiple human signaling pathways

Background Signaling pathways play important roles in the life processes of cell growth, cell apoptosis and organism development. At present the signal transduction networks are far from complete. As an effective complement to experimental methods, computational modeling is suited to rapidly reconstruct the signaling pathways at low cost. To our knowledge, the existing computational methods seldom simultaneously exploit more than three signaling pathways into one predictive model for the discovery of novel signaling components and the cross-talk modeling between signaling pathways. Results In this work, we propose a multi-label multi-instance transfer learning method to simultaneously reconstruct 27 human signaling pathways and model their cross-talks. Computational results show that the proposed method demonstrates satisfactory multi-label learning performance and rational proteome-wide predictions. Some predicted signaling components or pathway targeted proteins have been validated by recent literature. The predicted signaling components are further linked to pathways using the experimentally derived PPIs (protein-protein interactions) to reconstruct the human signaling pathways. Thus the map of the cross-talks via common signaling components and common signaling PPIs is conveniently inferred to provide valuable insights into the regulatory and cooperative relationships between signaling pathways. Lastly, gene ontology enrichment analysis is conducted to gain statistical knowledge about the reconstructed human signaling pathways. Conclusions Multi-label learning framework has been demonstrated effective in this work to model the phenomena that a signaling protein belongs to more than one signaling pathway. As results, novel signaling components and pathways targeted proteins are predicted to simultaneously reconstruct multiple human signaling pathways and the static map of their cross-talks for further biomedical research. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0841-4) contains supplementary material, which is available to authorized users.


Background
Signaling pathways play important roles in the life processes of cell growth, differentiation and apoptosis. The stimuli from extracellular environment and cellular matrix are sensed, amplified and transducted to nucleus via signaling pathways to yield complex biological responses (e.g. enzyme activity, transcription factors activation/deactivation, gene expression, ion-channel activity, etc.) [1]. Malfunction of signaling pathways is likely to lead to a variety of pathologies [2].
Protein-protein interaction (PPI) networks play fundamental roles in the study of signaling transduction, because extracellular signals are generally transmitted from membrane to nucleus via a series of PPIs and molecular modifications. Thus reconstruction of PPI networks, including experimental techniques [2,3] and computational modeling [4][5][6][7][8], has attracted much attention in recent years. At present, the existing computational methods for reconstruction of signaling pathways mainly rely on shortest path algorithm [9][10][11] and messagepassing algorithm [12]. For instance, Tuncbag et al. [12] used message-passing algorithm to derive directed forest from PPI networks, based on which to infer signaling pathways. These methods are simple with least demanding data constraint in that only PPI network topology is needed. Besides, the method [12] used confidence weighted interactome to explicitly counteract the noise of PPI network topology, so that the risk of false negatives and false positives is reduced. Nevertheless, PPI network topology based methods need to be further improved from the two concerns: (1) signaling pathways possibly contain feedback loops that make the shortest path algorithm inaccurate to yield false signaling components; (2) the experimental data of signaling components should be exploited to guide the search of novel signaling components in PPI networks.
Comparatively machine learning methods are effective to simultaneously exploit multiple experimental data of signaling pathways without prior knowledge about the underlying biochemical mechanism [13][14][15][16]. Reconstruction of signaling pathways can be decomposed into two steps: the first step is recognition of signaling components, and the second step is to link the predicted signaling components to the existing signaling pathways via experimental PPIs or predicted PPIs. The existing machine learning methods focus on the discovery of novel signaling components [14][15][16]. In [14], a multi-class SVM model is trained using the feature information of protein domain to predict novel signaling components. In [15], the data of experimentally verified signaling components are used to train a SVM model for the prediction of homologous signaling pathways. In [16], the ortholog pairs of known interacting signaling components from known signaling pathways, called signalogs, are directly treated as signaling PPIs and then used the signalogs to construct homologous signaling pathways. Actually, computational reconstruction of signaling pathways, as a complex problem, needs to address the three major concerns: (1) discovery of novel signaling components, especially those signaling components that belong to more than one signaling pathway; (2) linking to signaling pathways the predicted signaling components via signaling PPIs; (3) cross-talk modeling between signaling pathways. To our knowledge, the existing computational methods seldom explicitly address the three major concerns to date. Recently, cross-talk modeling between signaling pathways has aroused much attention. For instances, graph search method is used to find the common cross-talk signaling components between the three signaling pathways (EGFR, IGF-1R and IR) [17], and PRISM modeling language is used to formally describe the common modules between signaling pathways [18]. Unfortunately, these methods are generally descriptive and can neither predict novel signaling components nor model signaling cross-talks.
In this work, we propose a multi-label multi-instance transfer learning method to simultaneously reconstruct multiple human signaling pathways and model their cross-talks. In this method, the data of the known signaling components from 27 human signaling pathways are used to train a 28-class multi-label SVM (support vector machine) model, wherein the 28 th class contains the negative data that are randomly sampled from the proteins that do not belong to the 27 signaling pathways. The scenario that a signaling component is shared by or belongs to multiple signaling pathways is modeled under multi-label learning framework. To enrich the knowledge of the protein concerned, each protein is depicted with two instances, one instance called target instance is represented with its own gene ontology annotations, and the other instance called homolog instance is represented with the gene ontology annotations of its homologs. Besides enriching the target instance, the homolog instance is especially useful to substitute the target instance when the protein concerned is completely not annotated. Unlike traditional supervised learning, the evaluation of multi-label learning model is conducted using three performance metrics, namely exact match ratio, microaverage F-measure and macroaverage Fmeasure. To evaluate the reliability of the proposed model, we validate the proteome-wide predictions against recent literature as well as conduct cross validation on the training data. Then we link the predicted signaling components to signaling pathways via experimental PPIs and derive the cross-talks between the 27 human signaling pathways to provide valuable cues for further biomedical research.

Human signaling pathways
To date there are several major signaling pathway databases for free academic use, e.g. KEGG (Kyoto Encyclopedia of Genes and Genomes) [19], Reactome [20], SPAD (Signaling Pathway Database) [21], NetPath [22], SignaLink [23] etc. In this work, we choose Net-Path (http://www.netpath.org/) to construct the training for the reasons: (1) NetPath manually curates 35 human cancer/immune signaling pathways, the largest repository of human cancer signaling pathways at present to our knowledge; (2) The signaling components explicitly provided by NetPath are conveniently treated as the training data. KEGG is rather small and contains a limited number of human signaling pathways. The other databases like Reactome and SignaLink are timely updated, but likewise collect very limited number of human cancer signaling pathways so as not to directly serve our purpose. We incorporate the closely related cancerogenic signaling pathways into a single model to facilitate effective knowledge sharing. To date NetPath has manually curates 35 human immune/cancer signaling pathways that contain 11 sub-types of Interleukin (IL-1  IL-11). For simplicity, IL-1~IL-11 are merged into one single class, thus we obtain 27 human signaling pathways as shown in Table 1. The signaling components provided on the website (http://www.netpath.org/) are directly used as training data and the training data are further validated against SwissProt database [24] and GOA database [25] to remove those proteins that are not manually curated and contain empty set of gene ontology annotations. The number of signaling components of each signaling pathway is shown in Table 1.
In general, signaling pathways temporally and spatially communicate via common signaling components and common signaling PPIs. Take the experimental NetPath database for example, EGFR signaling pathway shares 108 common signaling components with Interleukin signaling pathway and 106 common signaling components with TCR signaling pathway. To measure the relatedness of any two signaling pathways, we define two cross-talk ratios: the cross-talk ratio of signaling components (CTR SC ) and the cross-talk ratio of signaling PPIs (CTR SPPI ). Assume A SC and B SC to denote the sets of signaling components of two signaling pathway A and B, then CTR SC is defined as CTR SC = |A SC ∩ B SC |/|A SC ∪ B SC |, where |A| denotes the cardinality of set A. CTR SC is actually the ratio of the overlap between set A SC and set B SC . The cross-talk ratio of signaling components (CTR SC )that is derived from the experimental NetPath database is illustrated in Fig. 1(a). We see that there generally are Similarly assume A SPPI and B SPPI to denote the sets of signaling PPIs of two signaling pathway A and B, we define the cross-talk ratio of signaling PPIs as CTR SPPI = |A SPPI ∩ B SPPI |/|A SPPI ∪ B SPPI | and derive the cross-talk ratio CTR SPPI from the experimental Net-Path database as illustrated in Fig. 1(b). Comparatively, the cross-talk ratio CTR SPPI is generally much lower than the cross-talk ratio CTR SC , implying that a signaling pathway more depends on cross-talk signaling components to stimulate or communicate with other signaling pathways than cross-talk signaling PPIs. Take TCR for example again, TCR seems to be more correlated with BCR (CTR SPPI = 4.3 %).
Using the obtained signaling components, we can easily train a predictive model to predict an unseen protein to one or more than one signaling pathway. To handle the case that a protein may not belong to any of the 27 signaling pathways, we need to construct a negative class for the completeness of classification. The negative class Fig. 1 a Matrix plots cross-talk ratio (%) of signaling components between the experimental human signaling pathways. b Matrix plots the cross-talk ratio (%) of signaling PPIs between the experimental human signaling pathways. c Matrix plots cross-talk ratio (%) of signaling components between the reconstructed human signaling pathways. d Matrix plots the cross-talk ratio (%) of signaling PPIs between the reconstructed human signaling pathways. The values along the diagonals are trivial and the color bar is used to highlight the magnitude of cross-talks between two human signaling pathways named others contains the proteins that either are not signaling proteins or do not belong to the 27 signaling pathways. The data of the others are randomly sampled from the proteins that do not belong to the 27 signaling pathways. Actually the space of class others is very large, we restrict the size of class others equal to that of the class that contains maximum number of signaling components for the purpose of reducing the risk of model bias.

Multi-label multi-instance transfer learning Transfer learning
Transfer learning has been proven effective in knowledge/ information transfer across related but heterogeneous domains [26]. In recent years, transfer learning, sometimes in the form of multi-task learning, has found many applications in computational biology [4,5,[27][28][29][30]. Knowledge transfer is generally conducted via model parameter optimization [27] and evolutionary homologs [4,5,[28][29][30]. As compared with the methods of object function optimization, homolog knowledge transfer is easy to be biologically interpreted and is robust against data unavailability. The machine learning frameworks that are adopted to implement knowledge transfer include ensemble learning [4,30], multi-instance learning [5], semi-supervised learning [27] and multi-kernel learning [28,29].
In this work, we use multi-learning framework to model the phenomena that a signaling protein belongs to more than one signaling pathway, and use multiinstance learning framework to implement homolog knowledge transfer. Each protein is represented with two instances, one instance called target instance is used to represent the GO feature information of the protein itself, and the other instance called homolog instance is used to represent the GO feature information of the homologs. The two instances are treated independently to augment the training data. AdaBoost has been used multi-instance learning framework [5], but here we adopt multi-label SVM (support vector machine) as base classifier instead in that SVM is more efficient to handle large data [31].

Multi-instance feature construction
Here each protein is represented with two instances, i.e. the target instance and the homolog instance. The homolog instance is constructed using the GO terms of the homologs, which are extracted from SwissProt database [24] using PSI-BLast [32] (E-value = 10) against all species. The GO terms are extracted from GOA database [25]. Using U to denote the training set, we obtain two sets of GO terms for each protein i, one set denoted as homolog set S i H contains the GO terms of the homologs, and the other set denoted as target set S i T contains the GO terms of the protein itself. Then the set of GO terms of training set U is defined as follows: Based on the denotations, the target instance and the homolog instance are formally defined as follows: Multi-label learning for modeling cross-talks between signaling pathways As illustrated in Fig. 1(a), most human signaling pathways share common signaling proteins. From points of view of machine learning, the phenomenon that one protein belongs to multiple signaling pathways is suited to be modeled by multi-label learning. At present there are two approaches to convert multi-label learning into traditional unique-label learning, one approach is label combination method, and the other approach is binary method [33]. Label combination method converts to new label encodings the label combinations that occur in the training data, e.g. the label combination {1, 2, 3} is encoded as {1}, the label combination {1, 4} is encoded as {2}, etc. Binary method trains one binary classifier for each class label by treating as positive the data associated with the class label and treats as negative the data associated with all the other class labels. Here we choose label combination method in that the method trains only one classifier for n-class problems, while the binary method needs to train n binary classifiers for n-class problems.
As compared with traditional supervised learning, the performance estimation of multi-label learning is more complicated. In traditional learning scenario, the standard evaluation criterion is accuracy. In multi-label learning scenario, a direct extension of accuracy is exact match ratio that regards the prediction as correct if and only if all the associated class labels are correctly predicted. However, exact match ratio does not count partial matches that are also significant to expand our knowledge. To take the partial matches into account, we adopt macro-average F-measure and micro-average Fmeasure [33] as multi-label learning performance metrics. Assume there are l testing instances, let y i denote the true label vector of the ith instance and let ў i denote the predicted label vector, then exact match ratio is defined as follows: where I denotes indicator function as defined below: The above definition of exact match ratio means that the prediction is viewed as correct if and only if all the labels of a protein are correctly recognized. It is easily to see that this definition is too rigorous to take partial matches into account. Actually partial match predictions are also valuable to us. Assuming that a protein is labeled with the label set {1, 2, 4}, the prediction cannot be simply deemed incorrect if the protein is predicted to the label subset {1, 2}, because the partial matches still provide valuable cues to us. For the reason, a proper performance metric for model estimation of multi-label learning should take partial matches into account.
Assume that the total label set is L = {1, 2, 3, …, d}, for the ith instance, the true label set is denoted as L i , and the predicted label set is denoted as Ĺ i . Then a set of d binary values are used to formally define the true label and the predicted label for the ith instance as follows: For label j, the performance metric precision (P) and recall (R) are defined as follows: Since F-measure is defined as F−measure ¼ 2 Â P Â R= P þR, the F-measure for label j is formally defined as follows: Macro-average F-measure is defined as the unweighted mean of the F-measures of all class labels: Micro-average F-measure considers the predictions from all instances and calculates the F-measure across all class labels as follows: Both the macro-average F-measure and the microaverage F-measure take partial matches into account. In this work, we use the target instances and the homolog instances separately to estimate the exact match ratio, the macro-average F-measure and the micro-average F-measure. The performance metrics are derived using Gaussian kernel: where ||Δ|| denotes 2-norm of vector Δ and the hyperparameter γ controls the flexibility of kernel.

Performance estimation by 10-fold cross validation
The proposed multi-label multi-instance transfer learning model is estimated by 10-fold cross validation to derive the exact match ratio, macro-average F-measure and micro-average F-measure. In multi-instance learning scenario, each data point is represented with multiple instances, so multiple predicted outcomes are yielded for each test data point in the test or prediction phase. The outcomes are easy to combine into one single outcome in unique-label learning scenario [5]. But outcome combination is not easy in multi-label learning scenario. A proper method is to provide the predicted outcomes of the target instance and the predicted outcomes of the homolog instance. In the training phase, both the target instances and the homolog instances participate in model training.
The computational results are given in Table 2. From Table 2, we can see that the proposed method achieves promising exact match ratio (target instance: 0.7558; homolog instance: 0.7055), which means that over 70 % proteins have their complete label sets correctly recognized. The results are fairly satisfactory though the exact match ratios are moderate, because fully recognizing the complete label set is actually a hard task. The exact match ratio of the homolog instance, though slightly lower than that of the target instance, suggests that the homolog knowledge is useful to the study of novel proteins we know little about. The slight decrease of exact match ratio is partly because the homolog instance carries a certain level of noise that results from evolutionary divergence. When partial matches are taken into account, the proposed model achieves fairly excellent macro-average F-measure (target instance: 0.9555; homolog instance: 0.9267) and micro-average F-measure (target instance: 0.9505; homolog instance: 0.9146). The performance difference between the homolog instance case and the target instance is more subtle, again demonstrating the feasibility of homolog knowledge transfer by means of independent homolog instance.
To further estimate the multi-label learning performance, we calculate the F-measure for each class (see Fig. 2). As illustrated in Fig. 2, the proposed method achieves over 0.9 F-measure for most classes. On the four classes (AR, EGFR, Hedgehog, TSLP), the Fmeasure is between 0.8 and 0.9. On the class others, the F-measure is unsatisfactorily about 0.5, partly because of the quality of randomly sampled data. Fortunately, the proposed model achieves sound performances on the 27 human signaling pathways, implying that the misclassifications on the class others brings little adverse effect to the 27 signaling pathways. In addition, the performance difference between the homolog instance and the target instance is fairly small (see Fig. 2), suggesting that the predicted outcomes of the homolog instances are equally valuable to us.

Simultaneous reconstruction of multiple human signaling pathways and their cross-talks modeling Predicting novel signaling components
Recognition of novel signaling components from proteome-wide candidate proteins is the first step of signaling pathway reconstruction. We extract the candidate proteins from SwissProt database [24] and further remove those proteins that have been included in the training data and those proteins that have neither target GO annotations nor homolog GO annotations. Thus we obtain 13,004 candidate proteins in total. The proteomewide predictions are given in Additional file 1 and the number of the predicted signaling components for each signaling pathway is given in Table 1. The details of the predicted signaling components for each signaling pathway are given in Additional file 2 (target instance) and Additional file 3 (homolog instance). The computational results show that many proteins are predicted to belong to more than one signaling pathway. From Table 1, we can see that the predicted label set of the target instance is much smaller than the predicted label set of the homolog instance and the intersection between the two label sets is not large. The results are largely attributed to the fact that the target instance is generally less enriched in GO annotations while the homolog instance is more enriched in GO annotations but carries a certain level of noise.

Linking predicted signaling components to pathways
Signaling proteins generally do not work in isolation but transmit signal via interaction with other proteins or biological molecules. The predicted signaling components needed to be linked to the current human signaling pathways via experimental or predicted proteinprotein interactions. For the sake of reliability, we use the experimental PPIs from HPRD database [34] to link the predicted signaling components. Once a predicted signaling component is linked to a signaling pathway, the corresponding PPI becomes a novel signaling PPI of the signaling pathway. Here novel signaling PPI does not mean the PPI is newly predicted, but mean that the PPI is newly treated as a part of the signaling pathway. From HPRD database, we obtain two kinds of signaling PPIs: (1) the PPIs between the predicted signaling components and the known signaling components; (2) the PPIs between the predicted signaling components. The derived signaling PPIs are given in Additional file 4 (target instance) and Additional file 5 (homolog instance). The number of novel signaling PPIs for each signaling pathway is shown in Table 1. Here we link the predicted signaling components to the current signaling pathways via experimental PPIs. We only illustrate Notch, TGF-β and TNF-α signaling pathways that are predicted by the target instances as examples (see Figs. 3, 4 and 5). As shown in Fig. 3, the predicted signaling components (nodes in red) elongate the existing Notch signaling pathway and form several triangle loops or protein complexes. The signaling pathway is largely elongated at the nodes {RING1, HDAC1, HDAC2, SIN3A, HES1}. At the node RING1, the predicted signaling components form several loops, for instance, a triangle loop {RING1, E2F6, RYBP}. In [35], experimental results demonstrate that E2F6 is a component of the mammalian polycomb complex that interacts with the polycomb group proteins {RYBP, RING1} to play a key role in the regulation of cellular proliferation and terminal differentiation. Centring around RING1, the predicted signaling components {RYBP, E2F6, CBX2, CBX4, CBX8, RNF2, PCGF2} of polycomb complex play important roles in modifying chromatin structure to regulate transcriptional activities, and communicate with the central transcriptional regulator of in Notch signaling via FHL1.
The common topological feature between the extended TGF-β signaling pathway (Fig. 4) and the extended TNF-α signaling pathway (Fig. 5) is that the predicted signaling components generally act as terminal proteins/peripheral proteins, or interact with the peripheral proteins of the existing signaling pathways to form redundant paths or loops. Take the peripheral protein NEDD4L of TGF-β signaling pathway as example (the upper peripheral of Fig. 4), the predicted signaling components {UBE2E1, CNOT4, CNOT8} elongate the TGFβ pathway, wherein CNOT4 has been experimentally demonstrated to activate the JAK/STAT pathway [36]. It can be inferred that CNOT4 acts as a cross-talk signaling component between TGF-β and JAK/STAT signaling pathways.
As illustrated in Fig. 5, the reconstructed TNF-α signaling pathway shows obvious modularity. The predicted signaling components are peripherally distributed to interact with the peripheral proteins of the existing TNF-α signaling pathway, and the interactions between the predicted signaling components elongate the TNF-α pathway with many redundant paths or loops. Take the peripheral protein COPB2 of TNF-α signaling pathway as example (the upper peripheral of Fig. 5), the predicted signaling components {COPA, COPE, COPG2, COPZ2, COPZ1, TAPBP, ARCN1, COPB1} interact with each other and link to the existing signaling component COPB2 via COPA. Moreover, the small motif {COPA, COPE, COPG2, COPZ2, COPZ1, TAPBP, ARCN1, COPB1} also links to the existing signaling component PRKCE (near the core of Fig. 5) via COPB1. The redundant paths help to enhance the robustness of TNF-α signaling pathway. The extended Notch, TGF-βand TNF-α signaling pathways predicted by the homolog instances are given in Additional file 6: Figure S1, Additional file 7: Figure S2 and Additional file 8: Figure S3. Interested readers are referred to Additional files 4 and 5 for other human signaling pathways.

Modeling signaling cross-talks
Cross-talk modeling is instrumental to study the regulatory and cooperative relationship between signaling pathways, based on which to further reveal the pathogenesis of diseases [37]. Signaling pathways generally communicate with each other via common signaling components and common signaling PPIs. For simplicity, we investigate here the static map of cross-talks only and do not discuss the temporal and spatial cross-talk mechanism. The details of the predicted common signaling components are given in Additional file 9 (target instance) and Additional file 10 (homolog instance). The experimental signaling components and the predicted signaling components are merged to derive the cross-talk ratio of signaling components CTR SC as illustrated in Fig. 1(c) (target instance) and Additional file 11: Figure S4 (homolog instance). Comparing Fig. 1(c) and Fig. 1(a), we can see that the cross-talk ratio CTR SC of the reconstructed signaling pathways is much lower than that of the experimental signaling pathways, in that the predicted novel cross-talk signaling components increase much slower than the predicted novel signaling components. From Fig. 1(c), TCR still significantly correlates with BCR (CTR SC = 9.2) and EGFR (CTR SC = 11.5). The details of the common signaling PPIs derived from HPRD database are given in Additional file 12 (target instance) and Additional file 13 (homolog instance). Similarly The experimental signaling PPIs and the predicted signaling PPIs are merged to derive the cross-talk ratio of signaling PPIs CTR SPPI as illustrated in Fig. 1(d) (target instance) and Additional file 14: Figure S5 (homolog instance).
The static map of cross-talks between TGF-β signaling pathway and TNF-αsignaling pathway (target instance) is illustrated in Fig. 6, where the color green denotes TGF-βsignaling components and signaling PPIs, the color blue denotes TNF-α signaling components and signaling PPIs, and the color red denotes the cross-talk signaling components and the cross-talk signaling PPIs. There are 52 cross-talk signaling components and 6 cross-talk signaling PPIs between TGF-β signaling pathway and TNF-α signaling pathway, of which 6 cross-talk signaling components and the 6 cross-talk signaling PPIs are predicted. From Fig. 6, we can see that most of the

Literature and KEGG validation
We further validate the proteome-wide predictions against recent literature and signaling pathway databases. Since the data we are concerned about are scarce and sparsely scattered among hundreds of literature, it is hard to collect sufficient evidences to validate the predictions. Nevertheless, we still find dozens of supporting evidences as shown in Table 3. For instances, four evidences are found for Notch signaling pathway. For the predicted signaling components or targets {RNF2, RING1B}, [38] has experimentally demonstrated that the Polycomb protein Ring1B promote the proliferation and self-renewal of embryonic neural stem/progenitor cells by repressing cell cycle inhibitors and maintaining Notch signaling pathway. For the predicted signaling component TBL1XR1, [39] has experimentally demonstrated that TBL1XR1 acts as a key player in the Fig. 4 Reconstruction of TGF-βsignaling pathway (target instance). The nodes and edges in green denote the signaling components and signaling PPIs of the experimental Notch signaling pathway. The nodes and edges in red denote the predicted signaling components and the derived signaling PPIs regulation of multiple signaling pathways (Wnt/β-catenin, Notch, NF-κB, and nuclear receptor) and gene transcription. For POGLUT1, [40] has demonstrated that POGLUT1 is a part of Notch signaling pathway that encodes protein O-glucosyltransferase 1 and is involved in posttranslational modification of Notch proteins. For the predicted signaling component SNX27 of TCR signaling pathway, [41] has experimentally shown that SNX27 is identified as a PDZ-containing component of the T cell immunological synapse and SNX27-positive Fig. 5 Reconstruction of TNF-αsignaling pathway (target instance). The nodes and edges in green denote the signaling components and signaling PPIs of the experimental Notch signaling pathway. The nodes and edges in red denote the predicted signaling components and the derived signaling PPIs endosomes polarise to the immunological synapse in response to TCR activation. As for TGF-βsignaling pathway, the proteins {RBPMS, BMP15} are predicted to be singnaling components. [42] shows that RBPMS interacts with TGF-β receptor type I (TbR-I), increases phosphorylation of C-terminal SSXS regions in Smad2 and Smad3, and promotes the nuclear accumulation of the Smad proteins. Fenwick 2013 [43] shows that Fig. 6 Cross-talks between TGF-βsignaling pathway and TNF-αsignaling pathway (target instance). The nodes and edges in green denote the predicted signaling components and derived signaling PPIs of TGF-βsignaling pathway. The nodes and edges in blue denote the predicted signaling components and derived signaling PPIs of TNF-αsignaling pathway. The nodes and edges in red denote the common signaling components and the common signaling PPIs BMP15 is a closely related TGF-βligand that is implicated as key regulators of follicle development and fertility. As for TNF-αsignaling pathway, [44] experimentally demonstrates that knock-down of the TNFα-induced protein TNFAIP8 in tumor cells decreases their oncogenicity, which suggests TNFAIP8 may be involved in carcinogenesis. As for WNT signaling pathway, [45] shows that the interaction between Hhex and SOX13 modulates Wnt/TCF pathway activity, and the interaction between SOX13 and TCF1 represses Wnt/TCF signaling. As for BCR signaling pathway, the Ingenuity Pathways Analysis shows that MAP3K12 is involved in BCR signaling pathway and PIK3C3 is involved in Prolactin signaling pathway [46]. As for Hedgehog signaling pathway, [47] has experimentally demonstrated that MED12 is linked biochemically and genetically to Hedgehog signaling pathway.
The evidences that support the proteome-wide predictions are very limited, so we resort to KEGG database [19] for further validation. Although the data in KEGG database are not newly published or updated, the data that are collected in KEGG database but not collected in NetPath database are also suited to be used as validation data. At present, the overlap rate of signaling components between NetPath and KEGG is very low. For instances, the overlap rate of TGF-βsignaling pathways between the two databases is 22.62 % and the overlap rate of TNFαsignaling pathways is only 13.77 %. Here more than sixty predicted signaling components are validated against KEGG database (see Table 3), suggesting that the proteome-wide predictions yielded by the proposed method are reliable. From Table 3, we can see that the homolog instances recognize more novel signaling components than the target instances, once again demonstrating that the homolog instances also yield valuable predictions.

Discussion
Signaling pathways play significant roles in the biological processes of cell growth, cell differentiation, cell apoptosis and organism development. At present, the current signaling pathways are far from complete. Computational modeling helps to accelerate the proteome-wide reconstruction and global cross-talk mapping of human signaling pathways. The existing computational methods focus on predicting signaling components and/or deriving orthologous signaling PPIs from the topology of signal transduction networks, or describing the molecular dynamics of signaling pathways. To our knowledge, no computational methods have been reported to simultaneously take more than two signaling pathways into account and explicitly predict their cross-talks. In this work, we propose a multi-label multi-instance transfer learning method to simultaneously reconstruct 27 human signaling pathways and model their cross-talks. The known signaling components of 27 human signaling pathways are directly exploited to train a 28-class predictive model (the 28th class is the negative class) and the model is used to predict proteome-wide novel signaling components. Then the predicted signaling components are linked to the current signaling pathways via the experimental PPIs in HPRD database. Based on the predicted signaling components and the derived signaling PPIs, we can conveniently reconstruct the 27 human signaling pathways and derive their cross-talks. Computational results show that both the target instances and the homolog instances achieve satisfactory multi-label learning performance and the homolog instances also Next we study the molecular functions that the predicted signaling components fulfil. As for TGFβsignaling pathway, the GO enrichment of the terms GO:0005515 (protein binding), GO:0046872 (metal ion binding) and GO:0004842 (ubiquitin-protein ligase activity) are 38.01, 27.74 and 11.99 %, respectively. As for TNF-αsignaling pathway, the GO enrichment for GO:0005524 (ATP binding), GO:0005515 (protein binding) and GO:0046872 (metal ion binding) are 23.58, 19.30 and 11.79 %, respectively. As for the cellular compartments that the predicted signaling components reside in, a majority of the predicted TGF-βand TNFαsignaling components are located in cytoplasm (GO:0005737), nucleus (GO:0005634) and cytosol (GO:0005829). The GO enrichment analysis of predicted TGF-βand TNF-αsignaling components is illustrated in Fig. 7, where only 10 top GO enrichments are given for each aspects of gene ontology. The full GO enrichment analysis of the predicted signaling components are given in Additional file 15 (biological processes), Additional file 16 (molecular functions) ad Additional file 17 (cellular compartments).

Comparison with the existing methods
The existing computational methods for reconstruction of signaling pathways are largely classified into two categories: graph search methods [9][10][11] and machine learning methods [14][15][16]. Graph search methods rely on PPI network topology to search for signaling pathways. These methods are simple with least data constraints, but feedback loops make the shortest path algorithm inaccurate. The existing machine learning methods focus on the discovery of novel signaling components. These methods exploit the experimental data of signaling components and mainly predict orthologues signaling pathways, but the methods seldom simultaneously exploit more than two signaling pathways and model their cross-talks. The proposed multi-label multiinstance method simultaneously exploits 27 human cancer signaling pathways to model the phenomenon that a signaling protein belongs to more than two signaling pathways. As compared with the existing methods, our method has the merit of explicit knowledge sharing and knowledge transfer between signaling pathways. After linking the predicted signaling components to signaling pathways, we can easily derive the cross-talk signaling components and cross-talk signaling PPIs.

Applicability
The method can be extended to solve other biological problems. The computational results provided in the supplementary files can be used as benchmark for novel method development or be used for further biomedical research.

Conclusion
In this work, we propose a multi-label multi-instance method to simultaneously reconstruct 27 human cancer signaling pathways and model their cross-talks. The proposed method demonstrates satisfactory multi-label learning performance and some of the proteome-wide predictions are validated against the signaling pathway databases (KEGG, Reactome and Signalink) and recent literature. The method and the results can be used for further model development and further biomedical research.