# A comparison of graph- and kernel-based –omics data integration algorithms for classifying complex traits

- Kang K. Yan
^{1}, - Hongyu Zhao
^{2}and - Herbert Pang
^{1}Email author

**18**:539

https://doi.org/10.1186/s12859-017-1982-4

© The Author(s). 2017

**Received: **5 June 2017

**Accepted: **26 November 2017

**Published: **6 December 2017

## Abstract

### Background

High-throughput sequencing data are widely collected and analyzed in the study of complex diseases in quest of improving human health. Well-studied algorithms mostly deal with single data source, and cannot fully utilize the potential of these multi-omics data sources. In order to provide a holistic understanding of human health and diseases, it is necessary to integrate multiple data sources. Several algorithms have been proposed so far, however, a comprehensive comparison of data integration algorithms for classification of binary traits is currently lacking.

### Results

In this paper, we focus on two common classes of integration algorithms, graph-based that depict relationships with subjects denoted by nodes and relationships denoted by edges, and kernel-based that can generate a classifier in feature space. Our paper provides a comprehensive comparison of their performance in terms of various measurements of classification accuracy and computation time. Seven different integration algorithms, including graph-based semi-supervised learning, graph sharpening integration, composite association network, Bayesian network, semi-definite programming-support vector machine (SDP-SVM), relevance vector machine (RVM) and Ada-boost relevance vector machine are compared and evaluated with hypertension and two cancer data sets in our study.

In general, kernel-based algorithms create more complex models and require longer computation time, but they tend to perform better than graph-based algorithms. The performance of graph-based algorithms has the advantage of being faster computationally.

### Conclusions

The empirical results demonstrate that composite association network, relevance vector machine, and Ada-boost RVM are the better performers. We provide recommendations on how to choose an appropriate algorithm for integrating data from multiple sources.

## Keywords

## Background

Recent advancements in –omics technologies have given us an unprecedented opportunity to understand the role of genomic, epigenetic, transcriptomic features in human health and complex diseases. With the lowering of sequencing cost and the availability of different sources of –omics data, more thorough and comprehensive analysis of complex phenotypes can be achieved by integrating these diverse data sources, as a single data source is unlikely to provide a full and clear picture of human diseases. Data integration may allow us to identify patterns that become evident across different experiments, such as the identification of disease-gene association by integrating different gene networks (i.e. functional interaction network, cancer module network and gene chemical network) using gene prioritization methods [1]. Thus, there is a great need to develop powerful data integration methodologies to fully harness the potential of these high-throughput data.

The ability to integrate multiple data sources can better inform researchers about the nature of the gene networks and biological interactions involved in disease. Each genomic data source used in an integrative method gives information on a different aspect of biology, such as mutation, regulation, and expression. For now, published results have shown that the results of integrated data set can outperform individual data source. For example, Taskesen *et al.* [2] have shown that prediction of known molecular subtype of acute myeloid leukemia could be further improved by integrating gene expression and DNA-methylation profiles. Ma *et al.* [3] have proposed an effective method for the integrative analysis of DNA-methylation and gene expression in epigenetic modules. Graph and kernel methods are common ways for integrating multiple data sources for the classification of binary traits. The raw data are first mapped using graph or kernel methods to form relationships between samples before the data integration step. Graph is a natural way to depict relationships among samples with subjects denoted by nodes and their relationships denoted by edges. Multiple graph- and kernel-based data integration algorithms have been proposed, making the selection of appropriate tools difficult. Recently, there has been a community effort to identify top data integration algorithms for predicting a continuous outcome such as drug sensitivity in human breast cancer cell lines [4]. However, up till now and to the best of our knowledge, there has not been reviews comparing the performance of these algorithms for binary outcomes. There is a lack of empirical studies on how the graph- and kernel-based data integration algorithms perform on real data. Therefore, our study aims to fill this gap by providing a comprehensive comparison of their performance, in terms of various measures of classification accuracy and computation time. We want to emphasize that the purpose of this paper is not to identify the best performing algorithm based on different combinations of data sources, but to compare the performance of data integration algorithms given a fixed number of data sources at hand.

## Methods

### Graph-based algorithms

*G*with

*n*indexed nodes (1, 2, ⋯,

*n*), where the first

*p*nodes are labelled as binary (known status),

*y*

_{1},

*y*

_{2}, ⋯,

*y*

_{ p }and

*y*

_{ i }∈ {−1, 1}, and the remaining

*n*−

*p*unlabelled nodes will be assigned as 0 (unknown status). The main task of graph-based semi-supervised learning is to classify these unlabelled nodes utilizing the network structure related to these nodes. The symmetric weight matrix

*W*, represents the connection strength between these nodes. The elements of

*W*are non-negative (

*w*

_{ ij }≥ 0) which represents the degree of association, and

*w*

_{ ij }= 0 means that there is no edge between node

*i*and node

*j*. The algorithm will generate an output function score f = (

*f*

_{1},

*f*

_{2}, ⋯,

*f*

_{ n })

^{ T }with two assumptions, (i) the score

*f*

_{ i }should be similar with the labelled node

*y*

_{ i }, and (ii) the score

*f*

_{ i }should be close to the score of its neighbour nodes. Then f can be inferred from the following objective function:

*y*

_{ i }and the function score

*f*

_{ i }; the second term, \( \sum \limits_{i,j=1}^n{w}_{ij}{\left({f}_i-{f}_j\right)}^2 \), corresponds to the smoothness assumption. Here,

*c*is a trade-off parameter which controls the importance of the smoothness versus loss. This objective function can be rewritten as,

*y*

_{1},

*y*

_{2}, ⋯,

*y*

_{ n })

^{ T }, and

*L*is defined as the Laplacian matrix of network

*G*,

*L*=

*D*−

*W*,

*D*= diag(

*d*

_{ i }), and

*d*

_{ i }= ∑

_{ j }

*w*

_{ ij }. The optimal solution can be obtained by f = (

*I*+

*cL*)

^{−1}y. Then we will predict the unlabelled nodes by the median cut-off. Node will be classified as

*y*

_{ i }= 1 when its function score

*f*

_{ i }is closer to the median function scores of nodes labelled as 1, otherwise, node will be classified as

*y*

_{ i }= − 1.

Computation can be time-consuming and memory intensive when the dimension of *L* gets large. In reality, *L* can be very sparse, which makes it possible for the graph-based semi-supervised learning to be applied in large scaled networks.

#### Graph-based semi-supervised learning

Given a group of nodes, different data sources may have different network structures and connection strengths among these nodes. Integrating different data sources by utilizing their network structure is an intuitive way for addressing the classification problem. Based on the concept of a single network graph-based algorithm, an extension using convex optimization model can be used to combine multiple data sources [5].

*L*

_{1},

*L*

_{2}, ⋯,

*L*

_{ m }, then this integration problem can be formulated as below:

^{ T }

*L*

_{ k }f over all networks.

*α*

_{ k },

*η*≥ 0), this objective function can be rewritten as following:

Note that *L*
_{
i
} is symmetric. This new objective function will achieve its optimal when the derivative of f equals to zero. Function scores can be solved by using \( \boldsymbol{f}={\left(I+\sum \limits_{k=1}^m{\alpha}_k{L}_k\right)}^{-1}\boldsymbol{y} \).

*f*is formulated in terms of Lagrange multipliers, and the sum of all Lagrange multipliers will be constrained by parameter

*c*. To solve this problem, substitute f in the objective function above, the convex optimization problem will be equivalent to a minimization problem:

*α*
_{
k
} is treated as the weight of the network structure *G*
_{
k
}. The optimal function score can be obtained after solving this convex optimization problem. Network structures with zero weights will be considered as redundant, which has no contribution to the optimal function score. The prediction process will be the same as the single network using a cut-off by median.

#### Graph sharpening integration

In reality, the Laplacian matrix can be very dense and high-dimensional occasionally, which will result in longer computation time when graph-based semi-supervised learning is performed. In order to reduce the computation time and maintain or increase the current performance of graph-based semi-supervised learning, Shin *et al.* [6] proposed the graph sharpening integration method that reduces the complexity of the weight matrix in the graph-based learning algorithm. The relationship among labelled and unlabelled points described by weight matrix *W* is symmetric while it is not desirable to be all symmetric**.** That is, some edges may carry more useful information in one direction than in the opposite direction. Therefore, edges between opposite labelled points maybe unnecessary. Removing some edges in a graph structure will yield a sparser and more parsimonious graph and reduce some computational burden. Suppose a network structure with weight matrix *W*, and *w*
_{
ij
} represents the edge strength from node *j* to node *i*. Firstly, edges from unlabelled nodes to labelled nodes will be removed, then edges between opposite labelled nodes will also be removed. That is, *w*
_{
ij
} = 0 if node *i* is labelled and node *j* is unlabelled or nodes *i*, *j* have opposite labels. The original dense *W* is forced to stay sparse by cutting these unhelpful edges. Even after the removal of these unnecessary edges in graph sharpening algorithm, it still preserves sufficient information of the original network structure. First, no information will be lost on the labelled nodes, their influence to neighbour nodes still exists. Second, the connection information of unlabelled nodes is also preserved. So the performance should be reasonable when compared to graph-based semi-supervised learning, this can be illustrated by the results shown in Shin *et al.* [6].

*W*in graph sharpening integration is no longer symmetric. The Laplacian matrix

*L*becomes asymmetric. Considering the objective function in graph-based integration algorithm, the optimal solution can be written as

Similar to graph-based semi-supervised learning, *α*, the weights of the different network structures can be obtained easily from the convex optimization problem by substituting *f* in the objective function. The prediction is once again based on the median cut-off.

*c*, which is a trade-off between loss of information and smoothness. This value will be determined by repeated

*k*-fold cross-validation using the training set through a search based on the following values.

#### Composite association network

It is obvious that the weights assigned to the different networks in graph-based semi-supervised learning and graph sharpening integration are determined by solving a convex optimization problem. The computation will be very costly unless *L* is very sparse. The composite association network approach [7] addresses this limitation by using linear regression to obtain the weights of different data sources.

*m*associated networks with symmetric weight matrices

*W*

_{ i }and that the elements of

*W*

_{ i }which indicate the edge strengths are all non-negative. Let y = (

*y*

_{1},

*y*

_{2}, ⋯,

*y*

_{ n })

^{ T }be the label vector of nodes in the networks and element

*y*

_{ i }be a binary variable,

*y*

_{ i }∈ {−1, 1}. The target network

*T*is defined as the functional relationships of y.

*T*

_{ ij }will take one of three values.

*n*

_{+}/

*n*

_{−}is the total number of positives/negatives in label vector. The target is to integrate the

*m*associated networks with weights α = (

*α*

_{1},

*α*

_{2}, ⋯,

*α*

_{ m })

^{ T }, and the composite weight matrix is \( \overline{W}=\sum \limits_{i=1}^m{\alpha}_i{W}_i \). Intuitively, in a target network

*T*, pairs of positive/negative labelled nodes will have high similarity whereas pairs with a positive node and a negative node will have low similarity. The values of

*T*will influence the weights of the composite association networks. The objective function will minimize the least squares error between target network

*T*and composite weight matrix \( \overline{W} \).

*trace*(

*AB*) =

*vec*(

*A*)

^{ T }

*vec*(

*B*), the objective function can be rewritten as below

As we mentioned above, the target network *T* only takes three values, that is *vec*(*T*) can be treated as pair-specific covariates. In our case, we specified three categorical variables: positive-positive, negative-negative and positive-negative [7]. Different from the graph based semi-supervised learning, the weight obtained with composite association network may be negative. To avoid this situation, *α*
_{
i
} will be set to zero when it is negative. Average weights *α*
_{
i
} = 1/*m* will overwrite the original weights when *α*
_{
i
} ≤ 0 for all *i* for the association networks. In practice, a bias weight *α*
_{0} will be added in α and the first column of Ω will be filled by one. *α*
_{0} will be discarded when integrating the weight matrices of the association networks.

Once we obtain the composite weight matrix \( \overline{W} \), we will employ the graph-based semi-supervised learning for a single network. The function scores can be solved by the formula f = (*I* + *cL*)^{−1}
y, where *L* is the Laplacian matrix related to weight matrix \( \overline{W} \). *c* will be set to 1 for the composite association network as in the original paper by Mostafavi *et al.* [8].

#### Bayesian network

Bayesian network [9] is a probabilistic directed acyclic graphical model that composed of a set of random variables and their conditional dependencies. Nodes in a Bayesian network represent different variables and their conditional dependencies are specified via directed edges. Each node is associated with a probability function that takes a particular set of values of its parent variables as input and gives the probability of the variable represented by this node as output. The main idea of this approach is that it involves Bayesian inference, that is, the posterior probability can be computed as the product of prior probability and likelihood probability. Now we will describe the use of Bayesian network for data integration.

*n*samples with

*m*variables

*v*

_{1},

*v*

_{2}, ⋯,

*v*

_{ m }, which are classified into two groups and labelled as

*y*, where

*y*∈ {−1, 1}, and the first

*k*variables

*v*

_{1},

*v*

_{2}, ⋯,

*v*

_{ k }are conditionally dependent and the remaining variables are conditionally independent given

*y*. With the given samples, the prior probability

*p*(

*y*) and the likelihood probability

*p*(

*v*

_{1},

*v*

_{2}, ⋯,

*v*

_{ m }|

*y*) can be obtained directly. Then the posterior probability of

*y*, denoted as

*p*(

*y*|

*v*

_{1},

*v*

_{2}, ⋯,

*v*

_{ m }) can be expressed as

*p*(

*v*

_{1},

*v*

_{2}, ⋯,

*v*

_{ m }) can be cumbersome, an intuitive way is to use the posterior odds ratio rather than the posterior probability. Posterior odds ratio can be computed by the likelihood odds ratio and the prior odds ratio. That is,

*Odd*

_{ proir }, which explains the proportion of the two groups in the sample set. Further, considering the conditional dependencies of these variables in the structure of Bayesian network, the likelihood function can be rewritten as.

Obviously, samples with *Odd*
_{
post
} > 1 will be classified as 1, otherwise −1. The larger the posterior odds ratio is, the more likely *y* will be classified as 1.

In our study, important SNPs/genes will be filtered from different data sources in the first step based on the process described by Klein *et al.* [16]. Briefly for each SNP/gene, its association with the dichotomized label will be tested and the filtered SNPs/genes that pass the Bonferroni corrected *P*-values will be included. Scores will be assigned to patients based on these filtered SNPs/genes. We discretize the scores into several bins based on their respective quartiles. Edges will be added between two nodes when their conditional correlation coefficients exceeded the threshold of 0.3. Both simple Bayesian networks and structured Bayesian networks are considered in our study. Illustrations of the four graph-based learning algorithms can be found in Additional file 1: Section A.

### Kernel-based algorithms

#### Semi-definite programming SVM

Support vector machine is a well-known kernel-based algorithm that can create hyperplane classifier by solving a quadratic program based on the kernel function and labels. The use of kernel functions provides a powerful approach to detect the nonlinear relationships in the feature space, i.e. a high-dimensional representation of numerical output variables. Its main goal is to search a linear classifier in the feature space that has the maximum margin distance between two groups. Semi-definite programming SVM [10, 11] that combines semi-definite programming framework with SVM, extends the quadratic program to multiple kernels. It is readily applicable to multiple kernel learning and makes it possible to integrate different data sources with different kernel functions.

*κ*= {

*K*

_{1},

*K*

_{2}, ⋯,

*K*

_{ m }}, and \( K=\sum \limits_{i=1}^m{\mu}_i{K}_i \) with embedding function Φ(

*x*), represented as linear combination of these kernels, the combined kernel

*K*is positive semidefinite if

*μ*

_{ i }≥ 0 for

*i*∈ {1, 2, ⋯,

*m*}. Thus, the

*μ*

_{ i }can be considered as the linear weights of kernel

*K*

_{ i }. Given a set of training data

*x*= (

*x*

_{1},

*x*

_{2}, ⋯,

*x*

_{ n }) with corresponding labels y = (

*y*

_{1},

*y*

_{2}, ⋯,

*y*

_{ n })

^{ T }, where

*y*

_{ i }∈ {−1, 1}. The objective hyperplane is

*w*

^{ T }Φ(

*x*) +

*b*= 0, where

*w*is the linear combination of kernel function corresponding to

*x*

_{ i }. The 1-norm soft margin SVM optimization problem can be described as follows.

*C*is a penalty parameter that trades-off between margin and loss. By considering its corresponding dual problem, Schölkopf and Smola [17] proved that the weight vector could be represented as \( w={\sum}_{i=1}^n{\alpha}_i\Phi \left({x}_i\right) \), where support vector α could be solved from the following equation.

*c*is a regularization parameter that controls the linear weights of the kernels and e is a vector of ones. This convex problem can be reformulated as a quadratically constrained quadratic program (QCQP) after considering its Lagrange dual problem.

This QCQP is a special form of semi-definite programming that can be solved efficiently with interior point methods [18]. The computational complexity of solving this SDP can be *O*(*mn*
^{3}) in the worst case. Solving this problem results in the optimal solution for α and the optimal values for its dual variables *μ*
_{
i
}. Finally, the hyperplane classifier *f* = *w*
^{
T
}
x + *b* will be calculated via formula \( w=\sum \limits_{i=1}^n{\alpha}_iK\left({x}_i,x\right) \) where \( K=\sum \limits_{i=1}^m{\mu}_i{K}_i \), and \( b=-\frac{\max_{i,{y}_i=-1}{w}^T{x}_i+{\max}_{i,{y}_i=1}{w}^T{x}_i}{2} \). An unclassified *x* will be classified as 1 when *f* is positive, otherwise will be classified as −1.

In our study, *c* is set to be the training set sample size that ensures the sum of the weights equals to one and *C* is determined by grid search.

#### Relevance vector machine

*x*

_{ n }∈

*R*

^{ d }and

*y*

_{ n }∈ {−1, 1}. The RVM classification model can be written as a linear combination of kernel functions

*k*

*W*= [

*w*

_{1},

*w*

_{2}, ⋯,

*w*

_{ N }] and

*K*= [

*k*(

*x*,

*x*

_{1}),

*k*(

*x*,

*x*

_{2}), ⋯,

*k*(

*x*,

*x*

_{ N })].

*m*samples will be reserved as relevance points. The probability is calculated by the following sigmoid function:

The performance of RVM can be very similar to SVM, but RVM is more competitive than SVM in the following aspects. (i) The result of RVM is sparser than SVM and the kernel computation time can be largely reduced; (ii) RVM can provide probabilistic prediction for classification problems by returning the class probabilities; (iii) RVM does not require the specification of a loss parameter; and (iv) Kernel function in RVM is more flexible without the Mercer’s condition [19] restriction.

*k*different associate data sources with a corresponding outcome

*Y*, where

*Y*= (

*y*

_{1},

*y*

_{2}, ⋯,

*y*

_{ n })

^{ T }and

*y*

_{ i }∈ {−1, 1}. For each data source, an individual RVM model will be generated with the corresponding kernel matrix, i.e. radial basis function kernel. Denote

*P*

_{1},

*P*

_{2}, ⋯,

*P*

_{ k }as the

*k*sets of probability prediction results from multiple RVM models, where

*P*

_{ i }is an

*n*× 1 vector. The final probability is given by

Note that *p*
_{
i
} is the probability of *y*
_{
i
} = 1. The cut-off point should be 0.5, which means sample will be classified as 1 when *p*
_{
i
} > 0.5. The greater *p*
_{
i
} is, the higher the chance that *y*
_{
i
} will be classified as 1.

#### Ada-boost RVM

*x*

_{ n }∈

*R*

^{ d }and

*y*

_{ n }∈ {−1, 1}. Let

*w*

_{ i }= 1/

*N*denote the weights of the training samples. First, train an RVM learner on

*n*random samples selected from the training set without replacement, denoted as

*RVM*

_{ t }, then calculate the weighted error for misclassification on the training samples in the

*t*

_{ th }iteration by formula \( {\varepsilon}_t=\sum \limits_{i=1}^N{w}_i \). If

*ε*

_{ t }≥ 0.5, jump to the next iteration; otherwise, set the weight of this learner

*RVM*

_{ t }equal to \( {\alpha}_t=\frac{1}{2}\ln \left(\frac{1-{\varepsilon}_t}{\ {\varepsilon}_t}\right) \), then the final model will update as

*RVM*

_{ final }=

*RVM*

_{ final }+

*α*

_{ t }

*RVM*

_{ t }. The weights of samples will be updated as

The new weights *w*
_{
i
} should be normalized such that \( \sum \limits_{i=1}^N{w}_i=1 \) before moving to the next iteration. After *T* iterations, the final model can be represented as *RVM*
_{
final
} = ∑_{
j
}
*α*
_{
j
}
*RVM*
_{
j
}, where *ε*
_{
j
} < 0.5 .

where *N* is the training set sample size. The pseudo code for Ada-boost RVM can be found in Additional file 1: Section B.

### Performance measure

*F*1 score (also called the F-measure) and the Area Under the receiver operating characteristic (ROC) Curve (AUC). Accuracy rate measures the percentage of entities which are correctly classified.

*F*1 score combines the precision and recall rates in classification problems, and can be calculated as the harmonic mean of precision and recall rates. Given a binary classification problem with

*P*positive and

*N*negative entities, the predicted and true labels can form a 2 × 2 confusion matrix. Four different values: true positive

*tp*, false positive

*fp*, false negative

*fn*and true negative

*tn*, can be calculated from this table. Sensitivity and specificity are defined as

*F*1 score can be calculated as

ROC curve captures the sensitivity as a function of (1-specificity). It illustrates the overall performance of a binary classifier by varying the discrimination threshold. The AUC has a value between 0 and 1. A value of 1 implies that the algorithm has a perfect classification while a value of 0.5 suggests that the algorithm is no better than a random guess.

These three performance measures are determined over 200 runs. 95% confidence intervals, calculated based on percentile bootstrap, are used to assess the variability of the algorithms. Computation time will also be considered as an evaluation factor in our study. It is clocked based a desktop running with R version 3.2.3 using an Intel Core i7 3.60 GHz PC with 16 GByte of memory. The computation time is based on integration of three different data sources that only include the model training session. Computation time of calculating the weight matrix and kernel matrix, and the filtering of SNPs/genes in the Bayesian network model are excluded.

## Data sets

Data from hypertension and cancer are used to evaluate and compare the seven data integration algorithms. Hypertension is known as the leading cause of cardiovascular mortality in the world [20]. Moreover, cancer and heart disease are the leading causes of death. Our understanding of these complex diseases from different angles of biology can be improved with the availability of multi-omics data integration algorithms. The Genetic Analysis Workshop (GAW) 19 data set was evaluated in our study, which includes data on genotypes, gene expression, and clinical data (including blood pressure and covariates such as smoking status and age). For this family data, there are 312 patients with normal blood pressure, and 305 pre-hypertension and hypertension subjects from 17 families.

Data sets used for evaluating the data integration algorithms

Data Set | Sample Size | Data Source | Platform | Numbers of Features |
---|---|---|---|---|

GAW 19 | 617 | Genotypes | lllumina Infinium Beadchips | 440,762 |

Gene Expression | lllumina Sentrix Human-6 Expression BeadChips | 20,634 | ||

Clinical Covariates | Clinical Data | 2 | ||

Ovarian | 135 | Gene Expression | Agilent G4502A | 17,814 |

miRNA Expression | Agilent Human miRNA 8x15K | 799 | ||

Protein Expression | Reverse phase protein array | 176 | ||

Methylation | HumanMethylation 27 | 24,981 | ||

Breast | 453 | RNA SeqV2 | Illumina HiSeq | 20,531 |

miRNA Expression | Agilent Human miRNA 8x15K | 1046 | ||

Protein Expression | Reverse phase protein array | 166 | ||

Methylation | HumanMethylation 450 | 396,065 |

The impact of imbalance data sets on the performance of the seven algorithms compared has also been investigated by real data simulation. In this simulation, we consider three additional situations, a more imbalanced and a more balanced breast cancer data sets by sampling without replacement, resulting in positive ER status against negative ER status ratios of 5:1 and 5:2, respectively. The breast cancer data set is chosen because it is the most imbalanced and has a relatively large sample size.

## Results

In this section, we present the empirical assessment of the seven data integration algorithms. The results compared in the following section are based on (1) Pearson correlation matrix; (2) simple Bayesian network and (3) radial basis function kernel with a scaling parameter sigma that is determined by grid search using 5-fold cross validation in the training set. The reasons are as following: In our study (1) Spearman’s rank correlation matrix and Pearson correlation matrix are used as weight matrix in graph-based semi-supervised learning, graph sharpening integration, and composite association network, the negative elements in the two correlation matrix will set to zero as weight matrix should be non-negative. The performance of Spearman’s rank correlation matrix is only slightly better than Pearson correlation matrix in most cases for the graph-based algorithms while its computational complexity is *O*(*n*
^{2} log *n*), which may become prohibitive for larger sample sizes; (2) Simple Bayesian network and structured Bayesian network are compared in our study. The performance of simple Bayesian network and structured Bayesian network are similar but structured Bayesian network leads to infinite odds ratio frequently due to small sample size; (3) Linear kernel and radial basis function kernel are tested in kernel based algorithms. Radial basis function kernel performs better than linear kernel in kernel-based algorithms in the three data sets investigated.

### Performance comparisons

#### Graph-based algorithms

First, we present the results of four graph-based algorithms. As described in the materials and methods section, the difference between graph-based semi-supervised learning and graph sharpening integration is the sparseness of the weight matrix. Compared to graph-based semi-supervised learning, the graph sharpening integration still performs reasonably well with sparser weight matrices obtained from the removal of undesirable edges in network structures. However, the performance of graph sharpening integration may not be as stable which is illustrated with the three data sets. Graph sharpening performs better than graph-based semi-supervised learning with the GAW 19 data set (62.1% mean accuracy rate against 60.0%) while it performs slightly worse than graph-based semi-supervised learning with ovarian and breast data set (63.3% mean accuracy rate compared to 66.7% in ovarian and 77.5% mean accuracy rate compared to 84.1% in breast). For Fig. 2, we can observe that the confidence interval of simple Bayesian network is slightly wider than other graph-based algorithms even though the mean accuracy rates of the various graph-based algorithms are similar for the GAW 19 data set. This indicates that simple Bayesian network has a larger prediction variation than other graph-based algorithms. Composite association network usually performs better than all of the other graph-based algorithms in terms of accuracy rate, F1 score and AUC with the advantage that it only requires solving one linear regression problem. Meanwhile, it is quite stable when considering the variability of these graph-based algorithms.

#### Kernel-based algorithms

The performance of kernel-based algorithms is usually better than graph-based algorithms, while the kernel-based model is more complex and requires longer computation time due to the need to generate the hyperplane classifier. In semi-definite programming SVM, different combinations of the two tuning parameters *c*, *C* may lead to long computation time in solving the QCQP. In our study, we found that it is particularly true when *C* is less than one. RVM and Ada-boost RVM are probabilistic models, which can return probability predictions but require longer computation time when compared with semi-definite programming SVM. It is observed that Ada-boost RVM can achieve good performance with our data sets when resampling size is set to 40% or 60% of the training sample size and maximum iteration number is set to 5 or 10.

It can be seen that semi-definite programming SVM has larger variation and lower performance when compared to RVM and Ada-boost RVM. The performance of RVM and Ada-boost RVM varies in the three data sets, which make it difficult to compare these two algorithms. But the difference of mean accuracy between RVM and Ada-boost RVM is very small.

### Imbalanced data simulation

Additional file 1: Section C presents the mean accuracy, mean F1 score and mean AUC of different integration algorithms in three simulated imbalanced data sets. Among the four graph-based algorithms, the performance of composite association network and Bayesian network is less influenced by imbalanced data. The imbalanced data simulation also suggests that composite association network usually outperforms Bayesian network. The performance of RVM and Ada-boost RVM are better and more stable in the imbalanced data simulations comparing to other graph-based or kernel-based algorithms. While for SDP-SVM, its performance is affected by the imbalanced data sets.

### Computation time

*O*(

*n*

^{3}). Additional iterations of boosting procedure in Ada-boost RVM requires more time than RVM when sample size is small, 23.19 s for Ada-boost RVM against 10.47 s for RVM with 100 training samples. While the computation time of Ada-boost RVM can be largely reduced as training sample size increases when compare to RVM. It is nearly 1 minute less than RVM when sample size reaches 400.

Average computation time (in seconds) of different integration algorithms with different training sizes

Integration Algorithms | Training Size 100 | Training Size 400 |
---|---|---|

Graph-based semi-supervised learning | 0.127 | 4.148 |

Graph sharpening integration | 0.052 | 1.943 |

Composite association network | 0.007 | 0.052 |

Bayesian network | 0.002 | 0.004 |

Semi-definite programming – SVM | 12.553 | 28.186 |

Relevance vector machine | 10.471 | 368.455 |

Ada-boost relevance vector machine | 23.190 | 306.172 |

## Discussion

In this paper, we conducted a comprehensive comparison study of seven graph- and kernel-based data integration algorithms of subject classification using GAW 19, ovarian cancer and breast cancer data sets. From the results, we observed that the kernel-based algorithms usually perform better than graph-based algorithms, but require longer computation time. On the other hand, the graph-based algorithms require less computation time, while the performance is not as good overall.

Graph-based semi-supervised learning and graph sharpening integration involve some tuning parameters, which can be selected via *k*-fold cross validation in the training sample. In our study, we observed that graph sharpening integration could lead to average weights frequently and more variable results since the sharpening may also remove important information. Moreover, in our study, graph sharpening integration tend to have a higher mean AUC score than graph-based semi-supervised learning even though the mean accuracy rate is lower, this indicates that the prediction can be improved for non-median cut-off. Bayesian network is very sensitive to noise, inappropriate bins setting and small sample size will result in infinite odds ratio. To avoid these situations, one should adjust the bin selection to make sure that sufficient samples are contained in each bin and using simple Bayesian network instead of structured model when sample size is small. The performance of composite association network is in general very good and stable. It assigns weights to different data sources by minimizing the least square error between target network and composite weight matrix, then predict via the combined weight matrix. This unique feature makes its training process simpler than other graph-based algorithms. We can conclude that employing the composite association network may be a good choice to integration different data sources when considering among graph-based algorithms.

Comparison of different data integration algorithms

Integration Algorithms | Computation Time | Stability | Characteristics |
---|---|---|---|

Graph-based semi-supervised learning | Low | Medium | Tuning parameter; performance can be poor sometimes |

Graph sharpening integration | Low | Low | Tuning parameter; average weights frequently occur |

Composite association network | Low | High | Average weights occur when all weights are negative |

Bayesian network | Low | Low | Bins selection and training sample size affect performance |

Semi-definite programming SVM | Medium | Low | Two tuning parameters; |

Relevance vector machine | High | High | Long training time; Probabilistic result |

Ada-boost relevance vector machine | High | Medium | Resampling size and iteration can be hard to determine |

The rationale for choosing these seven algorithms in our study is that these algorithms preserve data specific properties and can integrate data of different scales. Each data source will be transformed into an intermediate form, like a graph or kernel matrix. Graph-based integration, is a natural way to reveal the relationship among samples and it is less computationally intensive. For kernel-based integration, it is good at detecting nonlinear relationships between samples. There are other categories of integration algorithms such as the concatenation-based integration that combines multiple data sources as one large input matrix before analysis. The algorithms for this type of integration include LASSO regression and, elastic-net regression [24].

## Conclusions

## Declarations

### Acknowledgements

The GAW19 exome and whole genome sequence data were provided by the T2D-GENES Consortium. Additional genetic and phenotypic data for GAW19 were provided by the San Antonio Family Heart Study, San Antonio Family Diabetes/Gallbladder Study, and Starr County. We thank three anonymous reviewers for their valuable comments and suggestions.

### Funding

The GAW was supported by National Institutes of Health grant R01 GM031575. The T2D-GENES Consortium was supported by National Institutes of Health grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study were supported by National Institutes of Health grants R01 HL0113323, P01 HL045222, R01 DK047482, and R01 DK053889. The Starr County data were generated with support by National Institutes of Health grants R01 DK073541 and R01 HL102830. H. Zhao was partly supported by National Institutes of Health grants GM059507 and CA154295. This work was also partially supported by Research Grants Council - General Research Fund no. 17157416.

### Availability of data and materials

An R package that includes the three recommended algorithms, including composite association network, relevance vector machine, and Ada-boost relevance vector machine, is available at this URL http://web.hku.hk/~herbpang/MDIntegration.html. and http://zhaocenter.org/software/.

### Authors’ contributions

KKY implemented the different methods and conducted data analysis. HZ edited and revised the manuscript. HP designed the study, obtained study data, and provided guidance on methodology. All authors drafted the manuscript, read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Gonzalez GH, Tahsin T, Goodale BC, Greene AC, Greene CS. Recent advances and emerging applications in text and data mining for biomedical discovery. Brief Bioinform. 2016;17(1):33–42.View ArticlePubMedGoogle Scholar
- Taskesen E, Babaei S, Reinders MM, de Ridder J. Integration of gene expression and DNA-methylation profiles improves molecular subtype classification in acute myeloid leukemia. BMC Bioinf. 2015;16(Suppl 4):S5.View ArticleGoogle Scholar
- Ma X, Liu Z, Zhang Z, Huang X, Tang W. Multiple network algorithm for epigenetic modules via the integration of genome-wide DNA methylation and gene expression data. BMC Bioinf. 2017;18(1):72.View ArticleGoogle Scholar
- Costello JC, Heiser LM, Georgii E, Gonen M, Menden MP, Wang NJ, Bansal M, Ammad-ud-din M, Hintsanen P, Khan SA, et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat Biotechnol. 2014;32(12):1202–12.View ArticlePubMedPubMed CentralGoogle Scholar
- Tsuda K, Shin H, Scholkopf B. Fast protein classification with multiple networks. Bioinformatics. 2005;21(Suppl 2):ii59–65.View ArticlePubMedGoogle Scholar
- Shin H, Lisewski AM, Lichtarge O. Graph sharpening plus graph integration: a synergy that improves protein functional classification. Bioinformatics. 2007;23(23):3217–24.View ArticlePubMedGoogle Scholar
- Mostafavi S, Ray D, Warde-Farley D, Grouios C, Morris Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 2008;9(Suppl 1):S4.View ArticlePubMedPubMed CentralGoogle Scholar
- Mostafavi S, Morris Q. Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics. 2010;26(14):1759–65.View ArticlePubMedPubMed CentralGoogle Scholar
- Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol. 2005;23(8):951–9.View ArticlePubMedGoogle Scholar
- Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. A statistical framework for genomic data fusion. Bioinformatics. 2004;20(16):2626–35.View ArticlePubMedGoogle Scholar
- Lanckriet GRG, Cristianini N, Bartlett P, El Ghaoui L, Jordan MI. Learning the kernel matrix with semidefinite programming. J Mach Learn Res. 2004;5:27–72.Google Scholar
- Tipping ME. Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res. 2001;1(3):211–44.Google Scholar
- Tipping ME, Faul AC. Fast marginal likelihood maximisation for sparse Bayesian models. In: AISTATS; 2003.Google Scholar
- CC W, Asgharzadeh S, Triche TJ, D’Argenio DZ. Prediction of human functional genetic networks from heterogeneous data using RVM-based ensemble learning. Bioinformatics. 2010;26(6):807–13.View ArticleGoogle Scholar
- Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B. Learning with local and global consistency. Adv Neural Inf Proces Syst. 2004;16(16):321–8.Google Scholar
- Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308(5720):385–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Smola AJ, Schölkopf B: Learning with kernels: GMD-Forschungszentrum Informationstechnik; 1998.Google Scholar
- Nemirovski A: Interior point polynomial time methods in convex programming. Lecture notes 2004.Google Scholar
- Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.Google Scholar
- Chockalingam A. Impact of world hypertension day. Can J Cardiol. 2007;23(7):517–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Cancer Genome Atlas N. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61–70.View ArticleGoogle Scholar
- Cancer Genome Atlas Research N. Integrated genomic analyses of ovarian carcinoma. Nature. 2011;474(7353):609–15.View ArticleGoogle Scholar
- GG X, Huang JHZ. Asymptotic optimality and efficient computation of the leave-subject-out cross-validation. Ann Stat. 2012;40(6):3003–30.View ArticleGoogle Scholar
- Liu Q, Zhang B. Integrative omics analysis reveals post-transcriptionally enhanced protective host response in colorectal cancers with microsatellite instability. J Proteome Res. 2016;15(3):766–76.View ArticlePubMedGoogle Scholar