 Methodology Article
 Open access
 Published:
Predicting breast cancer using an expression values weighted clinical classifier
BMC Bioinformatics volume 15, Article number: 411 (2014)
Abstract
Background
Clinical data, such as patient history, laboratory analysis, ultrasound parameterswhich are the basis of daytoday clinical decision supportare often used to guide the clinical management of cancer in the presence of microarray data. Several data fusion techniques are available to integrate genomics or proteomics data, but only a few studies have created a single prediction model using both gene expression and clinical data. These studies often remain inconclusive regarding an obtained improvement in prediction performance. To improve clinical management, these data should be fully exploited. This requires efficient algorithms to integrate these data sets and design a final classifier.
LSSVM classifiers and generalized eigenvalue/singular value decompositions are successfully used in many bioinformatics applications for prediction tasks. While bringing up the benefits of these two techniques, we propose a machine learning approach, a weighted LSSVM classifier to integrate two data sources: microarray and clinical parameters.
Results
We compared and evaluated the proposed methods on five breast cancer case studies. Compared to LSSVM classifier on individual data sets, generalized eigenvalue decomposition (GEVD) and kernel GEVD, the proposed weighted LSSVM classifier offers good prediction performance, in terms of test area under ROC Curve (AUC), on all breast cancer case studies.
Conclusions
Thus a clinical classifier weighted with microarray data set results in significantly improved diagnosis, prognosis and prediction responses to therapy. The proposed model has been shown as a promising mathematical framework in both data fusion and nonlinear classification problems.
Background
Microarray technology, which can handle thousands of genes of several hundreds of patients at a time, makes it hard for scientists to manually extract relevant information about genes and diseases, especially cancer. Moreover this technique suffers from a low signaltonoise ratio. Despite the rise of highthroughput technologies, clinical data such as age, gender and medical history, guide clinical management for most diseases and examinations. A recent study [1] shows the importance of the integration of microarray and clinical data has a synergetic effect on predicting breast cancer outcome. Gevaert et al. [2] have used a Bayesian framework to combine expression and clinical data. They found that decision integration, and partial integration leads to a better performance, whereas full data integration showed no improvement. These results were obtained by using a cross validation approach on the 78 samples in the van’t Veer et al. [3] data set. On the same data set, Boulesteix et al. [4] employed random forests and partial least squares approaches to combine expression and clinical data. In contrast, they reported that microarray data do not noticeably improve the prediction accuracy yielded by clinical parameters alone.
The representation of any data set with a realvalued kernel matrix, independent of the nature or complexity of data to be analyzed, makes kernel methods ideally positioned for heterogeneous data integrations [5]. Integration of data using kernel fusion is featured by several advantages. Biological data has diverse structures, for example, high dimensional expression data, the sequence data, the annotation data, the text mining data and heterogeneous nature of clinical data and so on. The main advantage is that the data heterogeneity is rescued by the use of kernel trick, where data which has diverse data structures are all transformed into kernel matrices with same size. To integrate them, one could follow the classical additive expansion strategy of machine learning to combine them linearly [6]. These nonlinear integration methods of kernels have attracted great interests in recent machine learning research.
Daemen et al. [7] proposed kernel functions for clinical parameters and pursued an integration approach based on combining kernels (kernel inner product matrices derived from the separate data types) for application in a Least Squares Support Vector Machine (LSSVM). They explained that the newly proposed kernel functions for clinical parameter does not suffer from the ambiguity of data preprocessing by equally considering all variables. That means, a distinction is made between continuous variables, ordinal variables with an intrinsic ordering but often lacking equal distance between two consecutive categories and nominal variables without any ordering. They concluded that the clinical kernel functions represent similarities between patients more accurately than linear or polynomial kernel function for modeling clinical data. Pittman et al. [8] combined clinical and expression data for predicting breast cancer outcome by means of a tree classifier. This tree classifier was trained using metagenes and/or clinical data as inputs. They explained that key metagenes can up to to a degree, replace traditional risk factors in terms of individual association with recurrences. But the combination of metagenes and clinical factors currently defines models most relevant in terms of statistical fit and also, more practically, in terms of crossvalidation predictive accuracy. The resulting tree models provide an integrated clinicogenomic analysis that generate substantially accurate and crossvalidated predictions at the individual patient level.
Singular Value Decomposition (SVD) and generalized SVD (GSVD) have been shown to have great potential within bioinformatics for extracting common information from data sets such as genomics and proteomics data [9],[10]. Several studies have used LSSVM as a prediction tool, especially in microarray analysis [11],[12].
In this paper, we propose a machine learning approach for data integration: a weighted LSSVM classifier. Initially we will explain generalized eigenvalue decomposition (GEVD) and kernel GEVD. Later we will explore the relationships of kernel GEVD with weighted LSSVM classifier. Finally, the advantages of this new classifier will be demonstrated on five breast cancer case studies, for which expression data and an extensive collection of clinical data are publicly available.
Data sets
Breast cancer is one of the most extensively studied cancer types for which many microarray data sets are publicly available. Among them, we selected five cases for which a sufficient number of clinical parameters were available [3],[13][16]. All the data sets that we have used are available in the Integrated Tumor Transcriptome Array and Clinical data Analysis database (ITTACA). Overview of all the data sets are given in Table 1.
Microarray data
For the first three data sets, the microarray data were obtained with the Affymetrix technology and preprocessed with MAS5.0, the GeneChip Microarray Analysis Suite 5.0 software (Affymetrix). However, as probe selection for the Affymetrix gene chips relied on earlier genome and transcriptome annotation that are significantly different from current knowledge, an updated array annotation was used for the conversion of probes to Entrez Gene IDs, lowering the number of false positives [17].
A fourth data set consists of two groups of patients [3]. The first group of patients, the training set, consists of 78 patients of which 34 patients belonged to the poor prognosis group and 44 patients belonged to the good prognosis group. The second group of patients, the test set, consists of 19 patients of which 12 patients belonged to the poor prognosis group and 7 patients belonged to the good prognosis group. The microarray data was already background corrected, normalized and logtransformed. Preprocessing step removes genes with small profile variance, less than the 10th percentile.
The last data sets consists of transcript profiles of 251 primary breast tumors were assessed by using Affymetrix U133 oligonucleotide microarrays. cDNA sequence analysis revealed that 58 of these tumors had p53 mutations resulting in proteinlevel changes, whereas the remaining 193 tumors were p53 wt [16].
Clinical data
The first data of 129 patients contained information on 17 available clinical variables, 5 were excluded [13]: two redundant variables that were least informative based on univariate analysis in those variable pairs with a correlation coefficient exceeding 0.7, and three variables with too many missing values. After exclusion of patients with missing clinical information, this data set consisted of 110 patients remained in 85 of whom disease did not recur whilst in 25 patients disease recurred.
The second data in which response to treatment was studied, consisted of 12 variables for 133 patients [14]. Patient and variable exclusion as described above resulted in this data set. Of the 129 remaining patients, 33 showed complete response to treatment while 96 patients were characterized as having residual disease.
In the third data, relapse was studied in 187 patients [15]. After preprocessing, this data set retained information on 5 variables for 177 patients. In 112 patients, no relapse occurred while 65 patients were characterized as having a relapse.
The fourth data [3] consisted of predefined training and test sets same as that of corresponding microarray data. The last data set consisted of 251 patients with 6 available clinical variables [16]. After exclusion of patients with missing clinical information, this data set consisted of 237 patients of which 55 patients with p53 mutant breast tumor and the remaining patients without p53 mutant breast tumor.
Methods
In the first section, we will discuss about GEVD and represent it in terms of ordinary EVD. Then an overview of LSSVM formulation to kernel PCA and least squares support vector machines (LSSVM) will be given. Next, we formulate an optimization problem for kernel GEVD in primal space and solution in dual space. Finally, by generalizing this optimization problem in terms of LSSVM classifier, we propose a new machine learning approach for data fusion and classifications, a weighted LSSVM classifier.
Generalized Eigenvalue decomposition
The Generalized Singular Value Decomposition (GSVD) of m×N matrix A and p×N matrix B is [18]
where U, V are orthogonal matrices and columns of X are generalized singular vectors.
If B ^{T}B is invertible, then the GEVD of A ^{T}A and B ^{T}B can be obtained from Equations (1) and (2) as follows:
where Λ is a diagonal matrix with diagonal entries {\Lambda}_{\mathit{\text{ii}}}={\left(\frac{{\Sigma}_{{A}_{\mathit{\text{ii}}}}}{{\Sigma}_{{B}_{\mathit{\text{ii}}}}}\right)}^{2}, i=1,…,N.
Equation (3) can be represented in eigenvalue decomposition (EVD) as follows:
where U=(B ^{T}B)^{1/2}(X ^{T})^{−1}. The SVD of matrix A(B ^{T}B)^{−1/2} is given below:
The matrix (B ^{T}B)^{−1/2} is defined [19] as follows: Let EVD of B ^{T}B=T Σ T ^{T}, where columns of T are eigenvectors and Σ is a diagonal matrix. (B ^{T}B)^{1/2}=T Σ ^{1/2}T ^{T} and (B ^{T}B)^{−1/2}=T Q T ^{T}, where Q is a diagonal matrix with diagonal entries Q _{ ii }=(Σ _{ ii })^{−1/2}, i=1,…,N.
LSSVM formulation to Kernel PCA
An LSSVM approach to kernel PCA was introduced in [20]. This approach showed that kernel PCA is the dual solution to a primal optimization problem formulated in a kernel induced feature space. Given training set{\left\{{x}_{i}\right\}}_{i=1}^{N}, {x}_{i}\in {\mathbb{R}}^{d}, the LSSVM approach to kernel PCA is formulated in the primal as:
such that e _{ i }=w ^{T}φ(x _{ i })+b, i=1,…,N, where b is a bias term and φ(.): {\mathbb{R}}^{d}\to {\mathbb{R}}^{{d}_{h}} is the feature map which maps the ddimensional input vector x from the input space to the d _{ h }dimensional feature space.
Kernel PCA in dual space takes the form:
where α is an eigenvector, λ an eigenvalue and Ω _{ c } denotes the centered kernel matrix with ijth entry: Ω _{c,i,j}=K\left({x}_{i},{x}_{j}\right)\frac{1}{N}\sum _{r=1}^{N}K\left({x}_{i},{x}_{r}\right)\frac{1}{N}\sum _{r=1}^{N}K\left({x}_{j},{x}_{r}\right)+\frac{1}{{N}^{2}}\sum _{r=1}^{N}\sum _{s=1}^{N}K\left({x}_{r},{x}_{s}\right), with K(x _{ i },x _{ j })=φ(x _{ i })^{T}φ(x _{ j }) a positive definite kernel function.
Least squares support vector machine classifiers
A kernel algorithm for supervised classification is the SVM developed by Vapnik [21] and others. Contrary to most other classification methods and due to the way data are represented through kernels, SVMs can tackle high dimensional data (for example microarray data). Given a training set {\{{x}_{i},{y}_{i}\}}_{i=1}^{N} with input data {x}_{i}\in {\mathbb{R}}^{d} and corresponding binary class labels y _{ i }∈{−1,+1}, the SVM forms a linear discriminant boundary y(x)=sign[w ^{T}φ(x)+b] in the feature space with maximum distance between samples of the two considered classes, with w representing the weights for the data items in the feature space, b the bias term and φ(.): {\mathbb{R}}^{d}\to {\mathbb{R}}^{{n}_{1}} is the feature map which maps the ddimensional input vector x from the input space to the n _{1}dimensional feature space. This corresponds to a nonlinear discriminant function in the original input space. Vapnik’s SVM classifier formulation was modified in [22]. This modified version is much faster for classification because a linear system instead of a quadratic programming problem needs to be solved.
The constrained optimization problem for least squares support vector machine (LSSVM) [22],[23] for classification are defined as follows:
subject to:
with e _{ i } the error variables, tolerating misclassifications in cases of overlapping distributions, and γ the regularization parameter, which allows tackling the problem of overfitting. The LSSVM classifier formulation implicitly correspond to a regression interpretation with binary target y _{ i }= ±1.
In the dual space the solution is given by
with y=[y _{1},…,y _{ N }]^{T}, 1_{ N }=[1,…,1]^{T}, e=[e _{1},…,e _{ N }]^{T}, β=[β _{1},…,β _{ N }]^{T}, Ω _{i,j}=y _{ i }y _{ j }K(x _{ i },x _{ j }) where K(x _{ i },x _{ j }) is the kernel function.
The classifier in the dual space takes the form
where β _{ i } are Lagrange multipliers.
LSSVM and kernel GEVD
LSSVM formulations to different problems were discussed in [23]. This class of kernel machines emphasizes primaldual interpretations in the context of constrained optimization problems. In this section we discuss LSSVM formulations to kernel GEVD, which is a nonlinear GEVD of m×N matrix A, and p×N matrix B, and a weighted LSSVM classifier.
Given a training data set of N points \mathcal{D}={\left\{{x}_{i}^{\left(1\right)},{x}_{i}^{\left(2\right)},{y}_{i}\right\}}_{i=1}^{N} with output data {y}_{i}\in \mathbb{R} and input data sets {x}_{i}^{\left(1\right)}\in {\mathbb{R}}^{m}, {x}_{i}^{\left(2\right)}\in {\mathbb{R}}^{p} ({x}_{i}^{\left(1\right)} and {x}_{i}^{\left(2\right)} are the i ^{th} sample of matrices A and B respectively).
Consider the feature maps φ ^{(1)}(.) :{\mathbb{R}}^{m} →{\mathbb{R}}^{{n}_{1}} and φ ^{(2)}(.): {\mathbb{R}}^{p} →{\mathbb{R}}^{{n}_{2}} to a high dimensional feature space , which is possibly infinite dimensional. The centered feature matrices {\Phi}_{c}^{\left(1\right)}\in {\mathbb{R}}^{{n}_{1}\times N}, {\Phi}_{c}^{\left(2\right)}\in {\mathbb{R}}^{{n}_{2}\times N} become
where {\widehat{\mu}}_{\mathrm{\phi l}}=\frac{1}{N}{\Sigma}_{i=1}^{N}{\phi}^{\left(l\right)}\left({x}_{i}^{\left(l\right)}\right), l=1,2
LSSVM approach to Kernel GEVD
Kernel GEVD is a nonlinear extension of GEVD, in which data are first embedded into a high dimensional feature space introduced by kernel and then linear GEVD is applied. While considering the matrix A(B ^{T}B)^{−1/2} in Equation (4) and the feature maps φ ^{(1)}(.) :{\mathbb{R}}^{m} →{\mathbb{R}}^{{n}_{1}} and φ ^{(2)}(.) :{\mathbb{R}}^{p} →{\mathbb{R}}^{{n}_{2}} described in previous section, the covariance matrix of A(B ^{T}B)^{−1/2} in the feature space becomes C\approx {\Phi}_{c}^{\left(1\right)}{\left({\Phi}_{c}^{{\left(2\right)}^{T}}{\Phi}_{c}^{\left(2\right)}\right)}^{1}{\Phi}_{c}^{{\left(1\right)}^{T}} with eigendecomposition C v=λ v.
While considering kernel PCA formulation based on the LSSVM framework [24] was discussed in section ‘LSSVM formulation to Kernel PCA’ and EVD of C v=λ v in primal space, our objective is to find the directions in which projected variables have maximal variance.
The LSSVM approach to kernel GEVD is formulated as follows:
where v is the eigenvector in the primal space, \gamma \in {\mathbb{R}}^{+} is a regularization constant and e are the projected data points to the target space.
Defining the Lagrangian
with optimality conditions,
elimination of v and e will yield an equation in the form of GEVD
where \lambda =\frac{1}{\gamma} largest eigenvalue, {\Omega}_{c}^{\left(1\right)}, {\Omega}_{c}^{\left(2\right)} are centered kernel matrices and α are generalized eigenvectors. The symmetric kernel matrices {\Omega}_{c}^{\left(1\right)} and {\Omega}_{c}^{\left(2\right)} resolves the heterogeneities of clinical and microarray data by the use of kernel trick, where data which have diverse data structures are transformed into kernel matrices with same size.
In a special case of GEVD, if one of the data matrix is identity matrix, it will be equivalent to ordinary EVD. If {\left({\Phi}_{c}^{{\left(2\right)}^{T}}{\Phi}_{c}^{\left(2\right)}\right)}^{1}=I, then the optimization problem proposed for kernel GEVD (See Equation (5)) will be equivalent to optimization problem in [20] for the LSSVM approach to kernel PCA.
Weighted LSSVM classifier
Our objective is to represent kernel GEVD in the form of weighted LSSVM classifier. Given the link between LSSVM approach to kernel GEVD in Equation (5) and the weighted LSSVM classifier (see [25] in a different type of weighting to achieve robustness), one considers the following optimization problem in primal weight space:
with e=[e _{1},…,e _{ N }]^{T} a vector of variables to tolerate misclassifications, weight vector v in primal weight space, bias term b and regularization parameter \gamma \in {\mathbb{R}}^{+}. Compared to the constrained optimization problem for least squares support vector machine (LSSVM) [22],[23], in this case, the error variables are weighted with a matrix {\left({\Phi}_{c}^{{\left(2\right)}^{T}}{\Phi}_{c}^{\left(2\right)}\right)}^{1/2}.
The weight vector v can be infinite dimensional, which makes the calculation of v impossible in general. One defines the Lagrangian \mathcal{\mathcal{L}}\left(v,e,b;\alpha \right)=\frac{1}{2}{v}^{T}v+\frac{\gamma}{2}{e}^{T}{\left({\Phi}_{c}^{{\left(2\right)}^{T}}{\Phi}_{c}^{\left(2\right)}\right)}^{1}e{\alpha}^{T}\left\{\left(\left({\Phi}_{c}^{{\left(1\right)}^{T}}v\right)+b{1}_{N}\right)+ey\right\}, with Lagrange multipliers \alpha \in {\mathbb{R}}^{N}.
Elimination of v and e yields a linear system
with y=[y _{1},…,y _{ N }]^{T}, 1_{ N }=[1,…,1]^{T}, α=[α _{1},…,α _{ N }]^{T}, {\Omega}_{c}^{\left(1\right)}={\Phi}_{c}^{{\left(1\right)}^{T}}{\Phi}_{c}^{\left(1\right)} and {\Omega}_{c}^{\left(2\right)}={\Phi}_{c}^{{\left(2\right)}^{T}}{\Phi}_{c}^{\left(2\right)}.
The resulting classifier in the dual space is given by
with α _{ i } are the Lagrange multipliers, γ is a regularization parameter has chosen by user, K ^{(1)}(x,z)=φ ^{(1)}(x)^{T}φ ^{(1)}(z), K ^{(2)}(x,z)=φ ^{(2)}(x)^{T}φ ^{(2)}(z) and y(x) is the output corresponding to validation point x. The LSSVM for nonlinear function estimation in [25] is similar to the proposed weighted LSSVM classifier.
The symmetric, kernel matrices K ^{(1)} and K ^{(2)} resolve the heterogeneities of clinical and microarray data sources such that they can be merged additively as a single kernel. The optimization algorithm for the weighted LSSVM classifier is given below:
Algorithm: optimization algorithm for the weighted LSSVM classifier

1.
Given a training data set of N points \mathcal{D}={\left\{{x}_{i}^{\left(1\right)},{x}_{i}^{\left(2\right)},{y}_{i}\right\}}_{i=1}^{N} with output data {y}_{i}\in \mathbb{R} and input data sets {x}_{i}^{\left(1\right)}\in {\mathbb{R}}^{m}, {x}_{i}^{\left(2\right)}\in {\mathbb{R}}^{p}.

2.
Calculate LeaveOneOut cross validation (LOOCV) performances of training set with different combinations of γ and σ _{1},σ _{2} (bandwidths of kernel functions K ^{(1)}, K ^{(2)}) by solving linear system Equation (6) and Equation (7). In case the LeaveOneOut (LOO) approach is computationally expensive, one could replace it with a leave p group out strategy (pfold crossvalidation).

3.
Obtain the optimal parameters combinations (γ, σ _{1}, σ _{2}) which have the highest LOOCV performance.
The proposed optimization problem is similar to the the weighted LSSVM formulation in [24] which replaced {\left({\Phi}_{c}^{{\left(2\right)}^{T}}{\Phi}_{c}^{\left(2\right)}\right)}^{1} with a diagonal matrix to achieve sparseness and robustness.
The proposed method is a new machine learning approach in data fusion and subsequent classifications. In this study, the advantages of a weighted LSSVM classifier were explored, by designing a clinical classifier. This clinical classifier combined kernels by weighting kernel inner product from one data set with that from the other data set. Here we considered microarray kernels as weighting matrix for clinical kernels. In each of these case studies, we compared the prediction performance of individual data sets with GEVD, kernel GEVD and weighted LSSVM classifier. In kernel GEVD, σ _{1} and σ _{2} are the bandwidth of RBFkernel function K(x,z)=exp\left(\frac{\left\rightxz{}^{2}}{2{\sigma}^{2}}\right) of clinical and microarray data sets respectively. These parameters were chosen such that the pairs (σ _{1}, σ _{2}) which obtained the highest LOOCV performance. The parameter selection (see Algorithm) for the weighted LSSVM classifier are illustrated in Figure 1. For several possible values of the kernel parameters σ _{1} and σ _{2}, the LOO cross validation performance is computed for each possible combinations of γ. The optimal parameters are the combinations (σ _{1}, σ _{2}, γ) with best LOOCV performance. Remark the complexity of this optimization procedure because both the kernel parameters (σ _{1} and σ _{2}) and γ need to be optimized in the sense of the LOOCV performance.
Results
In all case studies except fourth, 2/3rd of the data samples of each class are assigned randomly to the training and the rest to the test set. These randomization are the same for all numerical experiments on all data sets. This split was performed stratified to ensure that the relative proportion of outcomes sampled in both training and test set was similar to the original proportion in the full data set. In all these cases, the microarray data were standardized to zero mean and unit variance. Normalization of training sets as well as test sets are done by using the mean and standard deviation of each gene expression profile of the training sets. In the fourth data set [3], all data samples have already been assigned to a training set or test set.
Initially LSSVM classifiers have been applied on individual data sets: clinical and microarray. Then we performed GEVD on training samples of clinical and microarray data sets and obtained generalized eigenvectors (GEV). Scores are obtained by projecting the clinical data on to the directions of GEV. LSSVM model is trained and validated on scores corresponding to training set and test set respectively.
Kernel GEVD
The optimal parameters of the kernel GEVD (bandwidths of clinical and microarray kernels) are selected using LOOCV performance. We applied kernel GEVD on microarray and clinical kernels. Then we obtained the scores by projecting clinical kernels on to the direction of kernel GEV. Similar to GEVD, LSSVM model is trained and validated on scores corresponding to training set and test set respectively. Highthroughput data such as microarray have used only for the model development. The results show that considerations of two data sets in a single framework improve the prediction performance than individual data sets. In addition, kernel GEVD significantly improve the classification performance over GEVD. The results of the five case studies are shown in Table 2 and Figure 2. We represent expression and clinical data with kernel matrix, based on RBF kernel function. The RBF kernel functions makes each of the these data which has diverse structures, transformed into kernel matrices with same size.
Weighted LSSVM classifier
We proposed a weighted LSSVM classifier, a useful technique in data fusion as well as in supervised learning. The parameters (γ in Equation (6) and σ _{1}, σ _{2} the bandwidths of microarray and clinical kernel functions) associated with this method are selected by Algorithm. In each LOOCV, 1  samples are left out and models are built for all possible combinations of parameters on the remaining N−1 samples. The optimization problem is not sensitive to small changes of bandwidths of microarray and clinical kernel functions. Careful tuning of γ allows tackling the problem of overfitting and tolerating misclassifications. Models parameters are chosen corresponding to the model with highest LOO AUC. The LOOCV approach takes less than a minute for a single iteration of the first three case studies and 12 minutes for the rest of case studies. Statistical significance test are performed in order to allow a correct interpretations of the results. A nonparametric paired test, the Wilcoxon signed rank test (signrank in Matlb) [26], has been used in order to make general conclusions. A threshold of 0.05 is respected, which means that two results are significantly different if the value of the Wilcoxon signed rank test applied to both of them is lower than 0.05. On all case studies, weighted LSSVM classifier outperformed all other discussed methods, in terms of test AUC, as shown in Table 2 and Figure 2. The weighted LSSVM performance on second and fourth cases slightly better, but not significantly, than the kernel GEVD.
To compare LSSVM with other classification methods, we have applied Naive Bayes classifiers individually to clinical and microarray data. In this case, the normal distribution were used to model continuous variables, while ordinal and nominal variables were modeled with a multivariate multinomial distribution. The average test AUC of this method, when applied on five case studies are shown in Table 3.
Then we compare the proposed weighted LSSVM classifiers with other data fusion techniques which integrate microarray and clinical data sets. Daemen et al. [7] investigated the effect of data integration on performance with three case studies [13][15]. They reported that a better performance was obtained when considering both clinical and microarray data with the weights (μ) assigned to them optimized (μClinical+(1 μMicroarray)). In addition they concluded from their 10fold AUC measurements that the clinical kernel variant, led to a significant increase in performance, in the kernel based integration approach of clinical and microarray. The first three case studies, we have taken from the work of Daemen et al. [7]. They have considered the 200 most differential genes selected from the training data with the Wilcoxon rank sum test, for the kernel matrix obtained from microarray. The fourth case study, we have taken from the paper of Gevaert et al. [2] in which they investigated different types of integration strategies, with Bayesian network classifier. They concluded that partial integration performs better in terms of test AUC. Our results also confirms that consideration of microarray and clinical data sets together, improves prediction performances than individual data sets.
In our analysis, microarraybased kernel matrix are calculated on large data set without preselecting genes and thus avoiding potential selection bias [27]. In addition, we compared RBF kernel with the clinical kernel function [7] on weighted LSSVM classifier, in terms of LOOCV performance. Results are given on Table 4. We followed the same strategy which was explained for weighted LSSVM classifier, except the clinical kernel function have been used for the clinical parameters. On three out of five case studies, RBF kernel functions performs better than clinical kernel function.
Discussion
Integrative analysis has been primarily used to prioritize disease genes or chromosomal regions for experimental testing, to discover disease subtypes or to predict patient survival or other clinical variables. The ultimate goal of this work is to propose a machine learning approach which is functional in both data fusion and supervised learning. We further analyzed the potential benefits of merging microarray and clinical data sets for prognostic application in breast cancer diagnosis.
We integrate microarray and clinical data into one mathematical model, for the development of highly homogeneous classifiers in clinical decision support. For this purpose, we present a kernel based integration framework in which each data set is transformed into a kernel matrix. Integration occurs on this kernel level without referring back to the data. Some studies [1],[7] already reported that intermediate integration of clinical and microarray data sets improves prediction performance on breast cancer outcome. In primal space, the clinical classifier is weighted with expression values. The solution in dual space is given on Equations (6) and (7) which provides a way to integrate two kernel functions explicitly and perform further classifications.
To verify the merit of the proposed approach over the single data sources such as clinical and microarray data, the LSSVM were built on all data sets individually for classifying cancer patients. Next, GEVD and kernel GEVD are performed. Then the projected variances in the new space (scores) have used to build the LSSVM. Finally weighted LSSVM approach was used for the integration of both microarray and clinical kernel functions and performed subsequent classifications. Thus weighted LSSVM classifier proposes a new optimization framework to solve the problem of classification using features of different types such as clinical and microarray data.
We should note that the models proposed in this paper are expensive, but less than the other kernelbased data fusion techniques. Since the proposed weighted LSSVM classifier simplified both data fusion and classification in a single framework, it does not have an additional cost for tuning parameters for kernelbased classifiers. And it is given that, the weighting matrix should be invertible in the optimization problem of kernel GEVD and the weighted LSSVM classifier.
In life science research, there is an increasing need for heterogeneous data integration such as proteomics, genomics, mass spectral imaging and so on. Such studies are required to determine, which data sets are most significant to be considered as weighting matrix. The proposed weighted LSSVM classifier integrates heterogeneous data sets to achieve good performing and affordable classifiers.
Conclusion
The results suggest that the use of our integration approach on gene expression and clinical data can improve the performance of decision making in cancer. We proposed a weighted LSSVM classifier for the integration of two data sources and further prediction task. Each data set is represented with a kernel matrix, based on the RBF kernel function. The proposed clinical classifier gives a step towards improving predictions for individual patients about prognosis, metastatic phenotype and therapy responses.
Because the parameters (bandwidth for kernel matrices and regularization term γ of weighted LSSVM) had to be optimized, all possible combinations of these parameters were investigated with a LOOCV. Since these parameters optimization strategy is time consuming, one can further investigate a parameter optimization criterion for kernel GEVD and weighted LSSVM.
The applications of proposed method are not limited to clinical and expression data sets. Possible additional applications of weighted LSSVM include integration of genomic information collected from different sources and biological processes. In short, the proposed machine learning approach is a promising mathematical framework in both data fusion and nonlinear classification problems.
References
van Vliet MH, Horlings HM, van de Vijver M, Reinders MJT: Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome . PLoS ONE. 2012, 7: e40385e40358. 10.1371/journal.pone.0040358.
Gevaert O, Smet F, Timmerman D, Moreau Y, De Moor B: Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks . Bioinformatics. 2006, 22: e184e190. 10.1093/bioinformatics/btl230.
van’t Veer LJ, Dai H, Van De Vijver MJ, HeY D, Hart AAM, Mao M, Peterse HL, Van Der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernard R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer . Nature. 2002, 415: 530536. 10.1038/415530a.
Boulesteix AL, Porzelius C, Daumer M: Microarraybased classification and clinical predictors: on combined classifiers and additional predictive value . Bioinformatics. 2008, 24: 16981706. 10.1093/bioinformatics/btn262.
Daemen A, Gevaert O, Ojeda F, Debucquoy A, Suykens JAK, Sempoux C, Machiels JP, Haustermans K, De Moor B: A kernelbased integration of genomewide data for clinical decision support . Genome Med. 2009, 1 (4): 3910.1186/gm39.
Yu S, Tranchevent LC, De Moor B, Moreau Y: Kernelbased Data Fusion for Machine Learning: Methods and Applications in Bioinformatics and Text Mining: Springer; 2011.
Daemen A, Timmerman D, Van den Bosch T, Bottomley C, Kirk E, Van Holsbeke C, Valentin L, Bourne T, De Moor B: Improved modeling of clinical data with kernel methods . Artif Intell Med. 2012, 54: 103114. 10.1016/j.artmed.2011.11.001.
Pittman J, Huang E, Dressman H, Horng C, Cheng S, Tsou M, Chen C, Bild A, Iversen E, Huang A, Nevins J, West M: Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes . PNAS. 2004, 101: 84318436. 10.1073/pnas.0401736101.
Sedeh SR, Bathe M, KJ B: The Subspace Iteration Method in Protein Normal Mode Analysis . J Comput Chem. 2010, 31: 6674. 10.1002/jcc.21250.
Alter O, Brown PO, Botstein D: Generalized singular value decomposition for comparative analysis of genomescale expression data sets of two different organisms . PNAS. 2003, 100: 33513356. 10.1073/pnas.0530258100.
Chu F, Wang L: Application of Support Vector Machine to Cancer Classification with Microarray Data . Int J Neural Syst World Sci. 2005, 5: 475484. 10.1142/S0129065705000396.
Chun LH, Wen CL: Detecting differentially expressed genes in heterogeneous disease using half Student’s ttest . Int I Epidemiol. 2010, 10: 18.
Chin K, De Vries S, Fridlyand J, Spellman PT, Roydasgupta R, Kuo WL, Lapuk A, Neve RM, Qian Z, Ryder T, Chen F, Feiler H, Tokuyasu T, Kingsley C, Dairkee S, Meng Z, Chew K, Pinkel D, Jain A, Ljung BM, Esserman L, Albertson DG, Waldman FM, Gray JW: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies . Cancer Cell. 2006, 10: 529541. 10.1016/j.ccr.2006.10.009.
Hess KR, Anderson K, Symmans WF, Valero V, Ibrahim N, Mejia JA, Booser D, Theriault RL, Buzdar AU, Dempsey PJ, Rouzier R, Sneige N, Ross JS, Vidaurre T, Gómez HL, Hortobagyi GN, Pusztai L: Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer . J Clin Oncol. 2006, 24: 42364244. 10.1200/JCO.2006.05.6861.
Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, HaibeKains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, van de Vijver MJ, Bergh J, Piccart M, Delorenzi M: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis . J Natl Cancer Inst. 2006, 98: 262272. 10.1093/jnci/djj052.
Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival . PNAS. 2005, 102 (38): 1355013555. 10.1073/pnas.0506230102.
Dai M, Wang AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson F, Jand Meng S, Pand Boyd: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data . Nucleic Acids Res. 2005, 33: e17510.1093/nar/gni179.
Golub GH, Van Loan CF: Matrix Computations . 1989, Johns Hopkins University Press, Baltimore
Higham N: Newton’s method for the Matrix square root . Math Comput. 1986, 46 (174): 537549.
Suykens JAK, Van Gestel T, Vandewalle J, De Moor B: A support vector machine formulation to PCA analysis and its kernel version . IEEE Trans Neural Netw. 2003, 14 (2): 447450. 10.1109/TNN.2003.809414.
Vapnik V: The Nature of Statistical Learning Theory: SpringerVerlag; 1995.
Suykens JAK, Vandewalle J: Least squares support vector machine classifiers . Neural Process Lett. 1999, 9: 293300. 10.1023/A:1018628609742.
Suykens JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J: Least Squares Support Vector Machines . 2002, World Scientific, Singapore
Alzate C, Suykens JAK: Multiway spectral clustering with outofsample extensions through weighted kernel PCA . IEEE Trans Pattern Anal Mach Intell. 2010, 32 (2): 335347. 10.1109/TPAMI.2008.292.
Suykens JAK, De Brabanter J, Lukas L, Vandewalle J: Weighted least squares support vector machines: robustness and sparse aprroximation . Neurocomputing. 2002, 48: 85105. 10.1016/S09252312(01)006440.
DawsonSaunders B, Trapp RG: Basic & Clinical Biostatistics . 1994, PrenticeHall International Inc., London
Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray geneexpression data . PNAS. 2002, 99: 65626566. 10.1073/pnas.102102699.
Acknowledgements
BDM is full professor at the Katholieke Universiteit Leuven, Belgium. Johan AK Suykens is professor at the Katholieke Universiteit Leuven, Belgium. Research supported by: Research Council KU Leuven: GOA/10/09 MaNet, KUL PFV/10/016 SymBioSys, START 1, OT 09/052 Biomarker, several PhD/postdoc & fellow grants; Industrial Research fund (IOF): IOF/HB/13/027 Logic Insulin, IOF: HB/12/022 Endometriosis; Flemish Government: FWO: PhD/postdoc grants, projects: G.0871.12N (Neural circuits), research community MLDM; IWT: PhD Grants; TBMLogic Insulin, TBM Haplotyping, TBM Rectal Cancer, TBM IETA; Hercules Stichting: Hercules III PacBio RS; iMinds: SBO 2013; Art&D Instance; IMEC: phd grant; VLK van der Schueren: rectal cancer; VSC Tier 1: exome sequencing; Federal Government: FOD: Cancer Plan 20122015 KPC29023 (prostate); COST: Action BM1104: Mass Spectrometry Imaging, Action BM1006: NGS Data analysis network; CoE EF/05/006, IUAP DYSCO, FWO G.0377.12, ERC AdG ADATADRIVEB. The scientific responsibility is assumed by its authors.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MT performed the kernel based data integration modeling and drafted the paper. KDB and JS participated in the design and implementation of framework. KDB, JS and BDM helped draft the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Thomas, M., Brabanter, K.D., Suykens, J.A. et al. Predicting breast cancer using an expression values weighted clinical classifier. BMC Bioinformatics 15, 411 (2014). https://doi.org/10.1186/s1285901404111
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901404111