Knowledge transfer via classification rules using functional mapping for integrative modeling of gene expression data

Background Most ‘transcriptomic’ data from microarrays are generated from small sample sizes compared to the large number of measured biomarkers, making it very difficult to build accurate and generalizable disease state classification models. Integrating information from different, but related, ‘transcriptomic’ data may help build better classification models. However, most proposed methods for integrative analysis of ‘transcriptomic’ data cannot incorporate domain knowledge, which can improve model performance. To this end, we have developed a methodology that leverages transfer rule learning and functional modules, which we call TRL-FM, to capture and abstract domain knowledge in the form of classification rules to facilitate integrative modeling of multiple gene expression data. TRL-FM is an extension of the transfer rule learner (TRL) that we developed previously. The goal of this study was to test our hypothesis that “an integrative model obtained via the TRL-FM approach outperforms traditional models based on single gene expression data sources”. Results To evaluate the feasibility of the TRL-FM framework, we compared the area under the ROC curve (AUC) of models developed with TRL-FM and other traditional methods, using 21 microarray datasets generated from three studies on brain cancer, prostate cancer, and lung disease, respectively. The results show that TRL-FM statistically significantly outperforms TRL as well as traditional models based on single source data. In addition, TRL-FM performed better than other integrative models driven by meta-analysis and cross-platform data merging. Conclusions The capability of utilizing transferred abstract knowledge derived from source data using feature mapping enables the TRL-FM framework to mimic the human process of learning and adaptation when performing related tasks. The novel TRL-FM methodology for integrative modeling for multiple ‘transcriptomic’ datasets is able to intelligently incorporate domain knowledge that traditional methods might disregard, to boost predictive power and generalization performance. In this study, TRL-FM’s abstraction of knowledge is achieved in the form of functional modules, but the overall framework is generalizable in that different approaches of acquiring abstract knowledge can be integrated into this framework. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0643-8) contains supplementary material, which is available to authorized users.


Background
With the advent of high-throughput 'transcriptomic' technology, biomarkers measured in tissue or bodily fluids have generated a vast amount of data, from which classification models can be and have been developed to predict the early development, diagnosis, and prognosis of diseases [1]. A major challenge for class prediction tasks is that a small sample size (tens to hundreds) and a large number of variables (ranging from hundreds to several thousand) characterize most types of 'transcriptomic' data, like gene expression data. Classification models learned from such high-dimensional data might not generalize well nor command a strong statistical support. In addition, the heterogeneity of sample sources and experimental protocols can make discovering robust biomarkers that can predict a disease state with high fidelity very difficult.
To address these challenges a combination of multiple, but independent studies, which were designed to investigate the same biological problem, have been proposed to improve classification performance in diagnostic and prognostic models [1][2][3][4]. Two of the most common strategies for combining "transcriptomic" data for integrative modeling are via meta-analysis and cross-platform data merging [5]. In the former approach, integration occurs at the interpretive level, where results (e.g., classification accuracy, pvalues, ranks, etc.) from individual studies are combined, while with the latter, integration occurs by rescaling of expression values into numerically comparable measures before the class prediction task.
A major limitation about these approaches is that they are unable to incorporate prior domain knowledge nor transfer latent biological information, which might help boost predictive performance. Studies by Ptitsyn and colleagues [6] revealed that the state (e.g., level of perturbations) of some pathways like, cell adhesion, energy metabolism, antigen presentation, and cell cycle regulation could predict metastasis progression in colorectal and breast cancer samples. Meanwhile, Huang et al. [7] suggested that pathway-based prognosis models for breast cancer performs better than a gene-based one. Thus, incorporating or transferring prior biological knowledge, such as the state of a pathway or functional associations of genes, into model generation could improve predictive performance on 'transcriptomic' datasets.
Ganchev and colleagues proposed a novel frameworktransfer rule learning (TRL)which leverages the concept of transfer learning to build an integrative model of classification rules from two datasets [8]. Transfer learning (TL) is the use of information learned from one task, which we call the source task, to learn another different, albeit related, task, which we call the target task [9]. Given two datasets, where one is designated as the source and the other as target, TRL builds classification rules according to two main steps. First, it learns a rule model on the source, and second, it transfers knowledge learned from the source model to seed learning of a new rule model on the target. TRL is a useful tool for integrative modeling for multiple microarray gene expression (MAGE) studies. Given two or more datasets, TRL can carry out integrative modeling in a pairwise fashion.
The TRL framework has limited capabilities. Its strategy for knowledge transfer could be improved. Generally, humans are able to recognize and apply knowledge learned from a previous task to a new task if they can align the commonalties between the two [9,10]. For instance, skills learned from a programming language like C++, could be applied to learn a new language, like Java. Both adhere to common programming principles (e.g., both implement a "for loop") even though the syntax can be different. Therefore, for transfer learning to be meaningful it is essential to capture the commonalities that the source and target share. TRL's mechanism for establishing this commonality is to identify common variables between the source and target datasets. However, studies have shown that different classification models built on independent microarray datasets can contain different sets of biomarkers with little overlap. In addition, models based on different variable sets can yield similar classification performance when tested on the same validation dataset [1,11,12]. This means that relying solely on identical variables to establish commonality might not be enough, and therefore exploring and incorporating other means of determining variable equivalence could be vital for model performance.
Several genes, though represented by different symbols, could have something in common. For instance, they might belong to the same biological pathway or be associated to the same disease. In humans, for example, the TP53 gene, which encodes the tumor protein p53, is known to play a key role in the activation and/or control of apoptosis [13]. Meanwhile, caspase-6, an effector caspase, which is encoded by the CASP6 genes, cleaves to other proteins to trigger the apoptosis process [13]. Superficially, TP53 and CASP6 are different, but they both play a prominent role in apoptosis. TRL and several meta-analysis methods cannot capture this functional similarity or many others for integrative analysis.
We present in this paper, TRL-FM, an extension to the TRL framework, which can capture and incorporate abstract knowledge to improve integrative modeling of MAGE datasets. TRL-FM leverages functional modules to capture and abstract underlying commonalities, such as functional similarities, among variables across MAGE datasets. To the best of our knowledge, this is the first paper proposing the application of functional modules via knowledge transfer for integrative rule modeling of multiple gene expression datasets.
A functional module (FM) consists of a group of cellular components and their interactions that can be associated with a specific biological process. An FM can be a discrete functional entity separable from other FMs or an amalgam of various FMs with a single functional theme [14]. TRL-FM posits that biomarkers that cooccur in the same FM can possess similar predictive value, so that they can serve as proxies for each other during knowledge transfer from one dataset to another. Armed with this basis, TRL-FM should be able to recognize functionally similar, but non-identical variables (e.g., TP53 and CASP6 as illustrated above) to facilitate knowledge transfer.
Our goal in this study was threefold. First, to test whether FMs can be used to capture the underlying commonality among variables of different but related gene expression datasets, and are more effective when used as bridges to assist knowledge transfer than relying on identical variables. Second, to test the hypothesis that integrative modeling via the TRL-FM approach outperforms traditional models based on single gene expression data sources. Last, to evaluate and compare the classification performance of TRL-FM with traditional methods, using 21 gene expression datasets that were collected from three respective studies: one on brain cancer, one on prostate cancer, and one on a lung disease (idiopathic pulmonary fibrosis or IPF). Figure 1 depicts an overview of the TRL-FM framework. For the sake of simplicity, this framework performs transfer between two different, but related, sets of microarray dataa source and a target. However, TRL-FM, as we will show later in this article, can glean information from several sources to facilitate knowledge transfera strategy that is akin to receiving advice from several experts.

Methods
The key steps according to the framework are as follows: First selectusing a feature selection methodrelevant variables from the source(s). Second, identify FMs among the selected variables. Third, using the discovered FMs, along with rules induced from the source(s) datasets; build a prior hypothesis of classification rules. Finally, using the prior hypothesis as a seed, learn a new classification rule model from the target dataset. TRL-FM is composed of four major components to execute these steps namely, feature selection via discretization, identification of functional modules, classification rule learning, and transfer learning of classification rules via functional mapping. We briefly describe these components below.
Feature selection via discretization MAGE datasets are comprised of hundreds or thousands of measured variables. The goal in integrative modeling is to identify and select a handful of relevant variables from among these hundreds or thousands that can accurately predict a disease state or estimate the risk of disease in an individual. The selected variables serve as building blocks for constructing classification models. Moreover, the variables in MAGE data are continuous in most cases, meaning that the variables can take an infinite number of possible values within a specified range. Continuous data pose several challenges to knowledge discovery and data mining tasks. It makes it more difficult to create compact, interpretable, and accurate classification models [15]. Several machine learning algorithms, such as decision trees [16] and rule learners [17,18], Fig. 1 The TRL-FM framework. The framework for knowledge transfer using functional mapping and classification rules works as follows. First, use a feature selector to select relevant variables from the source and target datasets. Second, combine the selected variables into a single list and partition them into functional modules (FMs). Third, using the discovered functional modules in addition to rules induced from the source data, build a prior hypothesis of classification rules. Finally, using the prior hypothesis as a seed, learn a new classification rule model on the target data which are used for learning classification models, handle discretized data much better [19]. It has also been shown that discretization, the process of converting a continuous variable to a discrete one, can improve the accuracy of some classifiers [20]. Integrated into the TRL-FM algorithm is a discretization method that converts continuous MAGE variables into discrete ones. After discretization, variables that have single-intervals can be filtered out since they cannot discriminate the target class. With this filtration strategy, discretization can also serve as a feature selection method.
We applied Efficient Bayesian Discretization (EBD) [19], a supervised discretization method, to discretize the input data. EBD uses a Bayesian score to discover the appropriate discretization, which guarantees optimal discretization of continuous variables from high-dimensional biomedical datasets. EBD has statistically significantly better performance than other commonly used methods for discretization [19] (see Additional file 1 for an overview of the algorithm).

Discovering Functional Modules
Given a list of arbitrary genes, several methods can be used to identify underlying biological commonalities, which can subsequently be abstracted into domain knowledge in the form of functional modules. Biological commonality here can mean association to a common disease, function, pathway, etc. Gene set enrichment analysis (GSEA), for instance, is a popular method which is used to identify functional sets of genes associated with particular conditions of interest from 'transcriptomic' data. There are a plethora of GSEA methods, each with its inherent strengths and limitations [21,22]. The focus of this paper is to highlight the utility of incorporating abstracted background knowledge to improve classification modeling, but not necessarily, to evaluate which knowledge abstraction method improves performance. To this end, we implemented a Gene Ontology (GO)-similarity-based method to identify commonalities among variables in MAGE datasets. Figure 2 illustrates steps to discover functional modules among an arbitrary list of genes (see Additional file 1 for additional details).

The protocol
First, we mapped each gene in the input set to the corresponding GO term(s) that annotate(s) the gene, according to the GO annotation database [23]. For example, if G denotes the set of input genes, then we map each gene g (where g ∈ G), to the GO term go (where g ∈ G),) that annotates it. Here, GO refers to a set of biological process terms in the GO. For example, the mapping M(g 1 ) = {go 1 , go 3 } means that terms go 1 and go 3 annotate gene g 1 . Subsequently, we formed a union of all GO terms that annotate at least one member of the input gene set. This set of GO terms served as input to the clustering phase.
Second, using semantic similarity [24] as a distance measure, we constructed a similarity matrix among the GO terms. With the similarity matrix as input, we applied the spectral clustering algorithm [25] to group the GO terms into functionally similar clusters. Subsequently, we applied the Silhouette value technique [26] to estimate appropriate cluster size as well as cluster validity.
Finally, we mapped each gene g i (i.e., keys of map M) to cluster C i if there existed at least one term in C i that annotates g i . This enabled us to identify groups of genes that perform the same or similar functions as well as genes that perform multiple functions. Any group of genes that mapped to a particular GO cluster (e.g., {g 1 , g 2 , g 3 } → C 1 ) forms a functional module.

Classification rule learning with RL
The TRL-FM framework is driven by the rule learner (RL) [18], a classification rule learning algorithm, which has been used successfully in several classification tasks involving genomic and proteomic studies [27][28][29][30].
Given a set of training examplesa vector of variable-value pairs, including a class label -RL learns a set of IF-THEN propositional rules. RL induces rules of the form: IF Condition THEN Consequent where the Condition consists of one or more variable tests, which we also call conjuncts, and the Consequent denotes prediction of the target variable, also known as a class variable. Every induced rule has classificationrelevant statistics associated with it. For example, let us consider the hypothetical rule below: IF ((gene1 > 1680) AND (gene2 ≤ 28.6)) THEN (Class = Case) where gene1 and gene2 are biomarkers with two intervals of values after discretization. We interpret the rule as follows: "when gene1 is up-regulated (i.e., > 1680) and gene2 is down-regulated (i.e., ≤ 28.6), then predict the target class as Case." Relevant statistics are associated to each rule induced by RL. In the given example above, the ensuing statistics mean that RL induced the rule with a 98 % degree of confidence, which we call the Certainty Factor (CF). Several rule evaluation functions, such as precision or Laplace estimate, are used by RL to calculate CF. P represents the p-value, computed by Fisher's exact test. We define other relevant statistics as follows: True Positives (TP) are the number of positive examples that are correctly predicted as positive, while False Positives (FP) are the number of negative examples that are incorrectly predicted as positive. The TP and FP values in the example above mean that, out of 60 data instances that the rule antecedent (Conditions) matched logically, 56 were predicted correctly.
RL has several characteristics that make it particularly suitable for use in biomarker discovery studies [31]. First, unlike other knowledge discovery algorithms like artificial neural networks or support vector machines, humans can easily interpret classification models learned by RL. Second, RL is simple and flexible such that, users can leverage domain knowledge to set learning parameters apriori in order to improve a search in the hypothesis space. Third, RL covers rule with replacement. That is, it does not recursively partition the instance space of the training example (e.g., C4.5 [16]), nor does it eliminate training instances covered by a rule as learning proceeds (e.g., CN2 [17]), but instead it allows rules to cover overlapping regions in the instance space. Covering training instances with replacement particularly suits situations where data are scarce (e.g., microarray data), since ample data will be available to provide statistical support for newly induced rules. Fourth, RL can handle nonlinear relationships as well as hierarchical variables, such as cancer and its subtypes. Fifth, to avoid costly errors, RL can abstain (i.e., it is agnostic) from predicting a test case when it has low confidence in the accuracy of the rule [27,31].
Internally, RL stores induced rules in a priority queue (aka, the beam) by sorting them according to their CF and coverage. By default, we set the beam-width (i.e., total number of rules in memory) to 1000. To construct a rule model,-a set of disjunctive rules-the RL algorithm proceeds as a heuristic beam search through the space of rules, using a general-to-specific approach [18]. First, it considers every variable as a potential predictor of the target class variable. For each discretized interval value of a marker, it creates as many rules as there are target class values. Example, for a Case/Control binary class, it will create two rules for each discretized marker value. One rule predicts Case and the other predicts Control. Second, it places an induced rule on the beam if it satisfies user-specified constraints, also known as good-rule criteria. The criteria are minimum CF value, minimum coverage, maximum false positive rate, and maximum conjuncts (i.e., the maximum number of variable-value pairs allowed in a rule antecedent). Subsequently, each rule on the beam is specialized if it satisfies Fig. 2 A protocol for identifying functional modules using spectral clustering and the Gene Ontology. Given an input set of genes, first map each gene to the corresponding GO term(s) that annotate(s) it according to the GO annotation database [23]. For example, if G denotes the set of input genes then we map each gene g (where g ∈ G), to the GO term go (where go ∈ GO) that annotates it. Here, GO refers to a set of biological process terms in the GO. For example, the mapping M(g 1 ) = {go 1 , go 3 } means that terms go 1 and go 3 annotate gene g 1 . Second, form a set union of all GO terms that annotate at least one member of the input gene set. Third, using semantic similarity [24] as a distance measure, construct a similarity matrix among the GO terms. Fourth, with the similarity matrix as input, applied the spectral clustering algorithm [25] to group the GO terms into functionally similar clusters. Fifth, apply the Silhouette value technique [26] to estimate appropriate cluster size as well as cluster validity. Finally, map each gene g i (i.e., keys of map M) to cluster C i if there exist at least one term in C i that annotates g i the constraints. Specialization is the process whereby the rule learner successfully adds conjuncts (i.e., markervalue pairs) to the rule antecedent until the constraints are violated. The algorithm stops and outputs the set of rules on the beam if there are no more rules to specialize. This set of classification rules output by RL is referred to as a rule model. The RL algorithm has been integrated as a subroutine in the TRL-FM algorithm (see Fig. 3).

Transfer learning of classification rules via functional mapping
The TRL-FM algorithm, illustrated in Fig. 3, implements the TRL-FM framework (see Fig. 1). In this version of the framework, transfer via functional mapping occurs between a single source and target dataset. You can modify it to include multiple source datasets or a list of biomarkers in place of the source. The latter is particularly useful when a source dataset is not readily available but markers, which can be mined from literature or gleaned from domain knowledge, are obtainable. Furthermore, this method of providing source information injects flexibility into the prior rules generation phase as it does not present the challenges, well elucidated by Ganchev et al. [31], that arise with mapping variable valuesor discretized intervalsacross the source and target datasets.
The algorithm accepts as inputs the source and target datasets, including user specified constraints (i.e., minimum CF, minimum coverage, inductive strengthening, and maximum conjuncts) for RL. EBD discretizes the input dataset if they contain continuous variables. Next, FMs are discovered among the selected variables from EBD to facilitate the transfer of knowledge for learning a model on the target dataset. Knowledge transfer occurs via the formulation of prior hypothesis (i.e., a set of rules), which is used to seed learning of the target model.
The source dataset is first analyzed using RL. The rule model learned on the source, in combination with the FMs, is used through the GeneratePriorRules function (see Fig. 4) to formulate a prior hypothesis, which is used to seed learning of a new rule model on the target. Using the FM as a bridge, the function instantiates prior rules as follows: Fig. 3 An algorithm for implementing the TRL-FM framework. This algorithm, a vast modification of the TRL algorithm [8], incorporates a subroutine (see Fig. 4) for mapping functionally related variables between the source and target data. The statements in red font are additions to the TRL algorithm 1. For a particular functional module, FM k , select a variable, a Tj , for rule instantiation if the condition below holds: 2. Build a prior rule "structure" as follows: 3. For every selected variable, instantiate a prior rule structure with all discrete ranges of values and all class values. For example, if the discretized ranges of values for a marker, a Tj , are LOW and HIGH and the target class values are Case and Control, then the instantiated rules become: Subsequently, the instantiated rules are loaded onto the beam and learning proceeds as a heuristic beam search in a typical RL fashion, as described above. In the specialization step (see Fig. 3), through the specialize() function, all non-redundant patterns obtained by adding a single variable-value pair to a rule's antecedent are considered. Note that all induced rules, including the prior rules, that do not satisfy the "good rule" criteria are pruned away (i.e., discarded). A rule is "good" if it satisfies the user-specified constraints.

Experiments
To test the feasibility of TRL-FM as a viable tool for integrative modeling of MAGE datasets we applied the framework to learn classification rule models using publicly available datasets. The goals of the experiments were threefold. First, to ascertain TRL-FM's ability and flexibility in capturing abstract biological knowledge from source datasets in order to facilitate transfer learning. Second, to evaluate the classification performance of models built by TRL-FM, and how it compares with traditional methods built on single source datasets. Last, compare the performance of integrative modeling via the TRL-FM approach with meta-analysis and crossplatform data merging methods. Table 1 provides details of the three example MAGE datasets that we used for the experiments. Each example contained 7 microarray studies of two-group comparison (i.e., case vs control). The datasets were collected from three studies: a brain cancer study, a prostate cancer study, and an IPF study. These datasets particularly suit the goals of our experiments and the utility of integrative modeling of MAGE datasets because, (1) they are publicly available, (2) they have been used extensively to test experiments in several integrative modeling studies, and (3) they were generated using diverse microarray platforms. Testing the flexibility of TRL-FM with datasets generated using diverse platforms is essential since TRL and many meta-analysis methods require identical platforms and variables for integrative modeling. That is, TRL-FM avoids the critical and often challenging task of mapping features (e.g., gene names) across disparate platforms for integrative modeling.

Experimental Design
Our task was to build a classification rule model that can classify normal tissue versus diseased tissue from same organ, e.g., to distinguish normal prostate tissue from prostate cancer. We designed our experiments to evaluate if knowledge learned from datasets of the same MAGE example set (i.e., same organ of origin, like IPF) can be transferred to enhance learning of a classification model on a new dataset. For each example set of the experimental datasets, consider a set of n datasets, D = {D 1 , D 2 , …, D n }, where D i represents the ith dataset. Within a set, each dataset, D i , in turn was set as target, while the rest, {D − D i }, were designated as source data for knowledge transfer. Guided by the TRL-FM framework (see Fig. 1), classification rule models were generated from the source datasets-one model per dataset. With this approach, n number of TRL-FM experiments can be performed within a set, so in all, we executed 21 (i.e., 3 X 7) experiments. This study design strategy was necessary to test the notion that knowledge transfer from multiple sources will more likely improve learning on the target.

Evaluation
We used the area under the Receiver Operative Characteristic curve (AUC) [32] to evaluate the predictive efficacy of TRL-FM. For each experiment, we measured the mean of the AUC on 10-fold cross-validation. In addition, we also estimated the performance of TRL-FM when each FM was used solely as a bridge for transfer. The rationale for this strategy was to ascertain whether particular functional themes improve the baseline performance (or not). We could have experimented with different combinations of the FMs to determine which particular group(s) optimizes learning. However, for n FMs, such an approach will yield approximately n 1 þ n 2 þ … þ n n models, which is computationally intractable-in the advent of highperformance computing, this process can be automated. For the sake of simplicity, we instead experimented with an ensemble of all FMs. In addition, we compared the performance of TRL-FM over TRL and RL (baseline). Note that the TRL framework is constrained with a single dataset as source, while TRL-FM extracts knowledge from multiple sources, via functional mapping. This means that for evaluating the TRL experiments, for every ith dataset that was designated as target, the rest of the datasets in the same study (e.g., IPF) in turn had to be set as source.
Finally, we compared the performance of our methods with traditional algorithms for single source datasets namely, Support Vector Machines (SVM), Linear Discriminant Analysis (LDA), Random Forest (RF), C4.5, Naïve Bayes (NB), and Penalized Logistic Regression (PLR). Using the same metric (i.e., AUC on 10-fold cross-validation), we evaluated the classification performance of these methods on the raw datasets as well as cross-study integration via meta-analysis and data merging. Several methods have been proposed for microarray data merging via meta-analysis and cross-platform data merging [5,33], but for the sake of brevity and demonstration purposes, we adopted the adaptively weighted (AW) Fisher method [34], while we applied COMBAT [35], which uses an empirical Bayes method to adjust for batch effect across multiple gene expression studies before merging.

Identification of functional modules
For each round of TRL-FM experiments, we identified a set of FMs from the list of relevant source variables. In all, 21 sets of FMs were generated, since each dataset, in turn, was set as target in each round of experiment. To simplify the rest of this discussion, we have randomly selected and present one FM table from each disease study -Tables 2, 3, and 4. We provide the rest as supplementary data (see Additional file 2). Table 2 shows the functional modules that facilitated rule transfer when we set Petalidis (brain cancer) as the target dataset. Similarly, Tables 3 and 4 represent functional modules when KangA (IPF) and Lapointe (prostate cancer) were set as targets, respectively. Observe that the number of functional modules are not necessarily the same for all target data. KangA, for example, had 14, while Lapointe contained 10. This means that the former and latter had 14 and 10 FMs, respectively, that were functionally homogenousthat is, had average Silhouette values of at least 0.5.
We made three observations from the functional modules. First, almost all of the functional themes were composed of more than one gene. Second, some genes were multi-functional. That is, they were associated with more than one different functional theme. In Table 2, for instance, ADCY3 (adenylate cyclase 3) was associated with DNA repair, protein phosphorylation, transport, and response to glucose stimuli. We made a similar observation in Table 3, where CBS (cystathionine beta synthase) was associated with some metabolic processes, brain development, and the regulation of kinase activity. Lastly, most of the discovered functional themes like signal transduction, apoptotic processes, cell differentiation, cell proliferation, and many others, are associated with the hallmarks of cancer [6,36].
The biological information revealed from these observations obtained using the TRL-FM style to capture, abstract, and formulate propositional rules for knowledge transfer could be essential for algorithm and model development for integrative modeling of MAGE datasets. Normally, for symbolic data mining algorithms like RL, the interestingness criteria (i.e., how good a rule is) for a newly induces rule is evaluated by objective methods like confidence (e.g., positive predictive value) and support (e.g., the probability that a pattern will occur). Other subjective methods, which leverage background knowledge or an expert opinion, have also been proposed to define explicit criteria for rule interestingness [37]. Prior knowledge, for instance, when gleaned from functional  modules, literature, and/or a domain expert, could be incorporated into classification rule induction to contribute, subjectively, to the evaluation of how good an induced rule is. However, while the incorporation of FMs into the TRL-FM framework facilitated the mapping of variables across source(s) and target datasets, the biological knowledge contained in them did not explicitly contribute to rule confidence within the rule-induction engine of the framework. It rather affected the learning bias of the algorithm by seeding the search with prior information. That is, instead of learning from scratch it starts learning from a point in the search space that is presumably closer to the target solution.

RL vs TRL vs TRL-FM
With Tables 5 and 6, we compare and contrast the performance TRL-FM with its predecessors, RL and TRL, on three datasets (Petalidis, KangA, and Lapointe), one from each disease type. Similarly, we provide results for the rest of the datasets as supplementary data (see Additional file 3 and Additional file 4). Tables 5 and 6 show the performances of classification rule models learned with and without TRL-FM (i.e., TRL-FM vs baseline RL) and TRL, respectively (i.e., TRL vs baseline RL). The goal of transfer learning is to improve the learning performance on the target task. Positive transfer occurs when the transferred knowledge from the source improves classification performance on the target, while negative transfer is the reduction of performance on the target after knowledge transfer. The AUCs (in Tables 5  and 6) with bold font denote positive transfer, while those resulting from negative transfer are underlined. Generally, learning with TRL-FM yielded more positive transfers than TRL. In addition, transfer with the union of FM usually produced positive transfer, while with TRL; you have to experiment with all available sources   to determine the best possible transfer. Thus, while the outcomes of integrative modeling with the TRL-FM framework will most likely lead to positive transfer, currently, it cannot be estimated, a priori, which particular source from the same set of data will lead to a positive transfer if you use the TRL framework. Furthermore, a Mann-Whitney paired-sample signed rank test with a significance level α = 5 % showed that transfer with TRL-FM statistically significantly improves the baseline than even the best TRL (see Table 8).
These results highlight the impact of FMs on the induction of a rule model. As we observed, no particular functional theme (s) consistently improved classification performance across all target datasets. That is, there was no direct correlation between functional themes and Table 5 Comparison of TRL-FM with baseline RL. AUCs when RL (baseline) and TRL-FM are applied to build a classification rule model on three datasets, Petalidis (brain), KangA (IPF), and Lapointe (prostate). For TRL-FM, the FMs are the medium through which knowledge transfer occurs. "Union" is an ensemble of all FMs. The mean and the standard error of the mean (SEM) for the AUC of a dataset was obtained by 10-fold cross-validation    The intuition here is that, the more related two domains are, the better the learning performance of transfer learning. In addition, when snippets of information from the FMs are fused together, potential errors inherent in knowledge transfer via individual FMs can be alleviated. Meanwhile, results from other studies support our take that a combination of FMs (e.g., group of pathways), more often than not, improves performance for integrative analysis of genomic data [7,38]. Furthermore, since TRL-FM is able to capture and abstract underlying domain knowledge, in the form of functional modules, it is able to go a step further to ask the question whether two or more identically different biomarkers have any commonality among them. This capability of TRL-FM makes it more intelligent and effective for transfer learning than TRL. That is, TRL-FM can facilitate knowledge transfer among MAGE datasets that have different variable symbols, as long as the variables can be mapped to a common biological function (s). For example, in the transfer of classification rules from the Larsson to KangA data (all from the IPF set), TRL is unable to transfer knowledge because the set of variables (MPP6, PKIG) used to build the source model does not overlap the set of variables (ASPN, FMO5, MMP11, IL13RA2) which the target model incorporates. TRL-FM, on the other hand, is able to transfer knowledge because of the association of PKIG (from source model) and ASPN (from target model) to cell signaling (Table 3). Another example here is the transfer of classification rules from the Nanni dataset to Lapointe dataset-both of the prostate set. As in the previous case, the set of variables (CCL1, MUC1, ATOX1, BCAM, BAT3) contained in the source model, does not overlap with that (MYL6, ADCY2, GJA1, TCEAL4, PARG, MTMR7, SEC23A, ACTA2, COQ7, SNAI2, MAP3K14) incorporated in the target model. Nevertheless, TRL-FM was able to use functional mapping via FMs to instantiate prior rules for seeding learning on the target using ADCY2, GJA1, SNAI2, and MAP3K14 due to their functional association with MUC1signal transduction and regulation of transcription (Table 4). Using the TRL framework, which requires the recognition of identical variables across the same source and target, this knowledge transfer could not have occurred.

Comparison with other methods
The results displayed with Tables 7 and 8 indicate that integrative modeling via the TRL-FM approach statistically significantly improves traditional models based on single source datasets. The advantage TRL-FM has over the traditional models is that it is able to pool information, via transfer learning and functional mapping, from other data sources to enhance model development.
Combining information from different source datasets, via biological knowledge bases, for model building may reduce inherent noise, which hampers predictive performance. Thus for transcriptomic datasets, which are mostly characterized by small sample sizes and large variable sets, integrative modeling, via the TRL-FM approach, is a viable mechanism to boost predictive power and generalization performance. Table 9 (see Additional File 6 for further details) shows the performance of all non-transfer learning based classifiers on the datasets after integration with metaanalysis. The results indicate that there were no significantly clear improvements in performance as compared to transfer learning. What is more, in Table 10 we compare the average classification performance within each disease type (e.g., brain cancer) versus the performance when disease specific datasets were merged, into one data matrix, via meta-analysis and batch effect removal. Classification performance on the meta-analysis inspired dataset was not significantly different from average performance per disease type. However, we observed a significant reduction in performance when disease specific datasets were merged via removal of systematic bias. This result is not too surprising as a similar observation was made in a related study [3]. It is most likely that the method could not handle, effectively, the heterogeneity inherent across the different studies.
Overall, integrative modeling, via the transfer rule learning and functional mapping approach performs better as compared to methods inspired by meta-analysis and cross-platform data merging. Highlights from the results suggest that in predictive model design it might be better to focus on sub populations or individual studies as opposed to merging independent studies into one data matrix. In addition, while meta-analysis is a viable approach for integrative MAGE analysis, it cannot transfer information among datasets in order to boost performance, and more so robust differential expression does not necessarily translate into high predictive power.

Limitations and future work
Though our preliminary empirical results suggest that the TRL-FM framework is sound, we have identified potential limitations and several avenues for future work. First, in building FMs we relied only on the GO as the information source. Although the results are promising, relying on GO as the only source from which to extract domain knowledge might limit the knowledge base of the framework for generating prior rules for transfer. Future work could expand this knowledge base by exploring and incorporating other methods of eliciting domain knowledge for transfer learning. For instance, the GO driven functional mapping module could morph into a lookup table, which would integrate information from other sources like Online Mendelian Inheritance in Man