 Research
 Open Access
 Published:
Networkbased support vector machine for classification of microarray samples
BMC Bioinformatics volume 10, Article number: S21 (2009)
Abstract
Background
The importance of networkbased approach to identifying biological markers for diagnostic classification and prognostic assessment in the context of microarray data has been increasingly recognized. To our knowledge, there have been few, if any, statistical tools that explicitly incorporate the prior information of gene networks into classifier building. The main idea of this paper is to take full advantage of the biological observation that neighboring genes in a network tend to function together in biological processes and to embed this information into a formal statistical framework.
Results
We propose a networkbased support vector machine for binary classification problems by constructing a penalty term from the F_{∞}norm being applied to pairwise gene neighbors with the hope to improve predictive performance and gene selection. Simulation studies in both low and highdimensional data settings as well as two real microarray applications indicate that the proposed method is able to identify more clinically relevant genes while maintaining a sparse model with either similar or higher prediction accuracy compared with the standard and the L_{1} penalized support vector machines.
Conclusion
The proposed networkbased support vector machine has the potential to be a practically useful classification tool for microarrays and other highdimensional data.
Background
The past two decades have witnessed rapid advances in gene expression profiling with the microarray technology, which not only brighten the prospect of deciphering the complexity of disease genesis and progression at the genomic level, but also revolutionize the diagnostic, therapeutic, and prognostic approaches. Up to recently, diagnostic classification and prognostic assessment have been based on conventional clinical and pathological risk factors, such as patient age and tumor size, many of which are believed to be secondary manifestation [1]. The advent of microarray technology allows researchers to explore primary disease mechanisms by comparing gene expression profiles for malignant and normal cells. The regularity and aberration in the expression patterns of certain genes shed light on their functions and pathological importance [2]. Studies that seek to identify gene markers to refine diagnostic classification and improve prognostic prediction in the context of gene expression data have enriched the literature [3–5]. In recent years, researchers have realized that gene markers identified from microarrays drawn from difierent studies on the same disease across similar cohorts lack consistency [6, 7]. A possibly more effective means to resolve this problem is to employ a networkbased approach, that is, to identify markers as gene subnetworks, defined as groups of functionally related genes based on a gene network, instead of treating individual genes as completely independent and identical a priori as in most existing approaches [1]. A novel networkbased approach proposed recently [1, 8] can be summarized as follows: (1) randomly searching subnetworks and assigning a score to each subnetwork that characterizes the subnetworkwise gene expression level; (2) identifying significant subnetworks that can well discriminate the clinical outcome; (3) constructing a classifier based on the significant subnetworks with a conventional statistical tool, such as logistic regression. Essentially such a networkbased approach aggregates gene expression data at the subnetwork level and then identifies and utilizes some significant subnetworks. It has been shown that such a networkbased approach not only improves predictive performance and reproducibility, but also sheds biological insights into molecular mechanisms underlying the clinical outcome. However, the above method is largely heuristic without a formal statistical framework; more importantly, it involves a random search over subnetworks, leading to possibly different results from different runs with no guarantee of the optimality of the final result. Because of the everincreasing popularity of penalization methods for highdimensional data, we propose a novel networkbased penalty to be used with the hinge loss, leading to a networkbased support vector machine. While maintaining some desirable properties of support vector machine (SVM) with the hinge loss function, the networkbased penalty directly integrates a biological network to realize more effective variable selection, as compared with generic methods, such as the standard SVM (STDSVM) or L_{1}penalized SVM (L1SVM).
The support vector machine (SVM) is one of the most popular supervised learning techniques with wideranging applications [9, 10]. In particular, previous studies have demonstrated its superior performance in gene expression data analysis, especially its ability to handle high dimensional data [11, 12]. Nevertheless, with categorical predictors, both the STDSVM and the L1SVM may have some shortcomings. Zou and Yuan [13] applied the concept of grouped variable selection and developed an F_{∞}norm penalized SVM to realize simultaneous selection/elimination of all the features derived from the same categorical factor (or a group of variables). Their numerical examples showed that the F_{∞}norm SVM outperformed the L1SVM in factorwise variable selection. We extend the idea of variable grouping to gene networks: rather than grouping all the dummy variables created from the same categorical factor, we treat two neighboring genes in a network as one group. The networkbased penalty is constructed as the sum of the F_{∞}norms being applied to the groups of neighboringgene pairs. With the hinge loss penalized by such a networkbased penalty as our objective function, we obtain our networkbased SVM. The later sections are organized as follows. We begin with a brief review of the SVM, and then introduce our proposed networkbased SVM. We evaluate its performance by simulation studies in both low dimensional and high dimensional data settings as well as two real data applications. The last section concludes the paper with a brief summary.
Methods
Existing methods
Suppose we have training data ${\{({x}_{i},{y}_{i})\}}_{i=1}^{N}$ with x_{ i }∈ ℝ^{p}and y_{ i }∈ {1, 1}. Define a hyperplane {x : f(x)= x^{T}β + β_{0} = 0}. The classification rule induced by f (x) is sign [$\widehat{f}$(x)]. SVM searches for such a hyperplane $\widehat{f}(x)={x}^{T}\widehat{\beta}+{\widehat{\beta}}_{0}$ that maximizes the margin between the training data points for class 1 and class 1:
where ξ_{ i }are slack variables, and C is a tuning parameter to be determined. The STDSVM has an equivalent hinge loss + penalty formulation as an optimization problem [13–15]:
where the subscript "+" denotes the positive part, i.e., z_{+} = max{z, 0}, ${\Vert \beta \Vert}_{2}^{2}={\displaystyle {\sum}_{k\mathrm{=}1}^{p}{\left{\beta}_{k}\right}^{2}}$, and λ is the tuning parameter. The solution to (1) is the same as that to (2).
The above STDSVM forces all nonzero coefficient estimates, which leads to the problem of its inability to conduct variable selection. The L1SVM was proposed to accomplish the goal of variable selection. It can be formulated as
where ${\Vert \beta \Vert}_{1}={\displaystyle {\sum}_{k\mathrm{=}1}^{p}\left{\beta}_{k}\right}$. The L1SVM wins over the STDSVM when the true model is sparse, while the STDSVM is preferred if there are not many redundant noise features [16].
Zou and Yuan [13] pointed out the shortcoming of the L_{1}norm penalty: even though it encourages parsimonious models, it fails to guarantee successful models in cases of categorical predictors due to the fact that each dummy variable is selected independently. They applied the concept of grouped variable selection and proposed an F_{∞}norm SVM to realize simultaneous selection/elimination of features derived from the same factor so as to accomplish automatic factorwise variable selection. Suppose we have G factors F_{1},...,F_{ G }. From each factor F_{ g }, we generate a feature vector ${x}_{(g)}={({x}_{1}^{(g)},\cdots ,{x}_{j}^{(g)},\cdots ,{x}_{{n}_{g}}^{(g)})}^{T}$.
Correspondingly we have the coefficient vector ${\beta}_{(g)}={({\beta}_{1}^{(g)},\cdots ,{\beta}_{j}^{(g)},\cdots ,{\beta}_{{n}_{g}}^{(g)})}^{T}$. Therefore,
Define the F_{∞}norm of F_{ g }as
The F_{∞}norm SVM is formulated as
The most noteworthy property of the F_{∞}norm SVM is its guarantee of sparsity at the factor level. Due to the singularity property of the infinity norm:  β_{(g)}_{∞} is not differentiable at β_{(g)}= 0, β_{(g)}will be exactly zero if the regularization parameter λ is properly chosen [13]. Therefore, the F_{∞}norm SVM automatically eliminates factors that are completely irrelevant to the response, and thus achieves the goal of factorwise selection. The empirical evidence shows that the F_{∞}norm SVM often outperforms both the L1SVM and the STDSVM.
New method
Biological observations reveal that neighboring genes in a network tend to function together in biological processes. To incorporate this prior information, a networkbased SVM for binary classification is proposed to facilitate generating models that extract more biological insight from gene expression data. The penalty term that characterizes the network structure can be specified by implanting the F_{∞}norm into the context of known functional interrelationships among genes by considering each pair of the functionally related genes as one group.
Consider a gene network with S denoting the set of all edges, i.e., the pair of connected genes.
S = {(j_{1}, j_{2}) : gene j_{1} and gene j_{2} are connected}
Define w_{ k }as some weight for gene k. For example, w_{ k }= $\sqrt{{d}_{k}}$ where d_{ k }is the number of direct neighbors of gene k, or w_{ k }= d_{ k }, or simply w_{ k }= 1 for all genes. We propose a novel penalty in the form of
Thus the networkbased SVM solves the optimization problem as follows.
Four properties of the penalty term are noteworthy. First, the regularization is performed at the level of grouped genes with each group containing two neighboring genes in the network. In the case of penalized linear regression, it has been proven that this penalty achieves the goal of eliminating both ${\beta}_{{j}_{1}}$ and ${\beta}_{{j}_{2}}$ simultaneously if (j_{1}, j_{2}) ∈ S [17]. The automatic selection of grouped features is due to the singularity of function max{a, b} [13]. This formulation satisfies our assumption that neighboring genes tend to (or not to) contribute to the same biological process at the same time. Second, the choice of the weight depends on the goal of shrinkage and influences the predictive performance. Consider a network comprised of several subnetworks, each with one regulator and ten target genes. Because of the singularity of function max(a, b) at a = b, the weighted penalty in the context of penalized regression, encourages $\left{\beta}_{{j}_{1}}\right/{w}_{{j}_{1}}=\left{\beta}_{{j}_{2}}\right/{w}_{{j}_{2}}$[17]. Here we examine three weight functions in particular: w_{ k }= 1, w_{ k }= $\sqrt{{d}_{k}}$, and w_{ k }= d_{ k }, where gene k has d_{ k }direct neighbors. The new method encourages $\left{\beta}_{{j}_{1}}\right=\left{\beta}_{{j}_{2}}\right$ if w_{ k }= 1, $\frac{\left{\beta}_{{j}_{1}}\right}{\sqrt{{d}_{{j}_{1}}}}=\frac{\left{\beta}_{{j}_{2}}\right}{\sqrt{{d}_{{j}_{2}}}}$ if w_{ k }= $\sqrt{{d}_{k}}$, and $\frac{\left{\beta}_{{j}_{1}}\right}{{d}_{{j}_{1}}}=\frac{\left{\beta}_{{j}_{2}}\right}{{d}_{{j}_{2}}}$ if w_{ k }= d_{ k }. Therefore, heavier weights (from w_{ k }= 1, w_{ k }= $\sqrt{{d}_{k}}$, to w_{ k }= d_{ k }) favor genes with more direct neighbors to have larger coefficient estimates; in other words, heavier weights relax the shrinkage effect for those regulators, which are known to be biologically more important. Due to this property, the choice of a heavy weight, as a simple strategy, enables us to alleviate the bias in the coefficient estimates from the penalization method and possibly improve the p predictive performance. Our default weight is w_{ k }= $\sqrt{{d}_{k}}$. The weight, considered as another tuning parameter, can be determined from crossvalidation or an independent validation data set, though we do not consider it here. Third, the penalty term, under certain conditions, tends to encourage a grouping effect, where highly correlated predictors tend to have similar coefficient estimates [17–20]. Fourth, the penalty is linear, which allows the solution to be found by the linear programming (LP) technique that is computationally convenient.
As usual, the fitted classifier is $\widehat{f}(x)={\widehat{\beta}}_{0}+{x}^{T}\widehat{\beta}$, and the classification rule is sign($\widehat{f}$(x)). We employ LP to obtain the solutions to (8) by
subject to
where
and ${\beta}_{j}={\beta}_{j}^{+}{\beta}_{j}^{}$, in which ${\beta}_{j}^{+}$ and ${\beta}_{j}^{}$ denote the positive and negative parts of β_{ j }. The calculation of the new method can be easily implemented by the R package lpsolve, so is the computation of the L1SVM. The R package e1071 (with linear kernel) is used to obtain the solution to the STDSVM.
Results and discussion
Simulation
We conducted several simulation studies to numerically evaluate the performance of the networkbased SVM along with the STDSVM and L1SVM. The simulation setups were similar to those in [18]. We started from a simple network consisting of 5 subnetworks, each having a regulator gene t (t = 1,...,5) that regulated 10 target genes, leading to a total of 55 genes (p = 55). We assumed that two out of the five subnetworks were informative; that is, the coefficients of 22 genes were nonzero and thus informative to the outcome, while the remaining 33 noise genes had no effect on the outcome. We generated a simulated data set by the following steps:

Generate the expression level of regulator gene t, X_{ t }~ N (0, 1), t = 1,..., 5, independently.

Assume that the expression level of regulator gene t and each of its regulated genes follow a bivariate normal distribution with correlation 0.7. Thus, the expression level of each target gene regulated by gene t, ${X}_{l}^{(t)}$ ~ N(0.7X_{ t }, 0.51), l = 1,..., 10 and t = 1,..., 5.

Generate the outcome Y from a logistic regression model: Logit (Pr(Y = 1X)) = X^{T}β + β_{0}, β_{0}= 2, where X is the vector of the expression levels of all the genes, and coefficient vector $\beta =({\beta}_{1}^{(1)},\mathrm{...},{\beta}_{10}^{(1)},\mathrm{...},{\beta}_{1}^{(5)},\mathrm{...},{\beta}_{10}^{(5)})$.
Four sets of true coefficients, β 's, were specified to reflect four scenarios:

1.
$$\beta =(5,\underset{10}{\underset{\u23df}{\frac{5}{\sqrt{10}},\cdots ,\frac{5}{\sqrt{10}}}},5,\underset{10}{\underset{\u23df}{\frac{5}{\sqrt{10}},\cdots ,\frac{5}{\sqrt{10}}}},0,\cdots ,0).$$
.
The effect of one informative subnetwork was the same as the other in magnitude but with an opposite direction.

2.
$$\beta =(5,\underset{10}{\underset{\u23df}{\frac{5}{\sqrt{10}},\cdots ,\frac{5}{\sqrt{10}}}},3,\underset{10}{\underset{\u23df}{\frac{3}{\sqrt{10}},\cdots ,\frac{3}{\sqrt{10}}}},0,\cdots ,0).$$
.
Both informative subnetworks had positive effects but in different magnitudes.

3.
$$\beta =(5,\underset{7}{\underset{\u23df}{\frac{5}{\sqrt{10}},\cdots ,\frac{5}{\sqrt{10}}}},\frac{5}{\sqrt{10}},\frac{5}{\sqrt{10}},\frac{5}{\sqrt{10}},3,\underset{7}{\underset{\u23df}{\frac{3}{\sqrt{10}},\cdots ,\frac{3}{\sqrt{10}}}},\frac{3}{\sqrt{10}},\frac{3}{\sqrt{10}},\frac{3}{\sqrt{10}}0,\cdots ,0).$$
.
Target genes in the same informative subnetworks had both positive and negative effects.

4.
$$\beta =(5,\underset{6}{\underset{\u23df}{\frac{5}{\sqrt{10}},\cdots ,\frac{5}{\sqrt{10}}}},\underset{4}{\underset{\u23df}{\frac{5}{\sqrt{10}},\cdots ,\frac{5}{\sqrt{10}}}},3,\underset{6}{\underset{\u23df}{\frac{3}{\sqrt{10}},\cdots ,\frac{3}{\sqrt{10}}}},\underset{4}{\underset{\u23df}{\frac{3}{\sqrt{10}},\cdots ,\frac{3}{\sqrt{10}}}},0,\cdots ,0).$$
.
It was similar to but more extreme than scenario 3.
Five methods, STDSVM, L1SVM, and networkbased SVM with w_{ k }= 1, w_{ k }= $\sqrt{{d}_{k}}$, and w_{ k }= d_{ k }, were compared based on the results averaged over 100 runs under each of the above four scenarios. For each run, 100 observations were simulated as training data to build a classifier (with any given λ), another 100 for tuning the regularization parameter λ, and the last 10,000 as test data. Each predictor was normalized to have mean 0 and standard deviation 1. Given any value of λ, we obtained the coefficient estimates from the training set, then applied the classifier to the tuning set to find the classification error. We searched for $\widehat{\lambda}$, from a wide range of prespecified values, which produced the smallest classification error. The classifier corresponding to $\widehat{\lambda}$ was identified as the fitted classifier $\widehat{f}$. Then we applied $\widehat{f}$ to the test set and calculated the test error, the number of misclassifications divided by the test sample size. Table 1 reports the mean classification error of the test set and its standard error (SE in parentheses), the standard deviation of the classification errors divided by the square root of the number of runs, for each method over 100 runs under each scenario. To evaluate each method's ability to select informative genes, we examined the false negatives, defined as the number of informative genes whose coefficients were estimated to be zero. In addition, we also considered a smaller sample size: we repeated the entire process with 50 training data points, 50 tuning data points, and again 10,000 test data points. The networkbased SVM is named as "New" in the table.
According to our simulation setups, the correct weight function should be w = $\sqrt{d}$. However, we find that the new method with w = d overwhelmingly beat all other methods in all the setups. It consistently made the most accurate classifications and missed no informative genes. The new method with w = $\sqrt{d}$ performed the second best: in most cases, it improved the classification accuracy over STDSVM and L1SVM; and under all the settings, it produced models that identified more informative genes than the L1SVM. In contrast, w = 1 did not bring much gains over the STDSVM or the L1SVM. The L1SVM led to models that were too sparse, missing about 14 and 11 informative genes for n = 50 and n = 100 respectively. The superior performance and the larger model size of the heavy weight (w = d) compared with its counterparts (w = 1 and w = $\sqrt{d}$) is presumably due to its relaxation of the shrinkage effect. The penalization methods shrink the $\widehat{\beta}$ toward zero by imposing the constraints (the penalty term) and therefore introduces bias to $\widehat{\beta}$. By grouping neighboring genes, the new method encourages the pairwise weighted absolute coefficients to be equal. Therefore, a heavy weight leads to larger $\widehat{\beta}$ for regulator genes. By choosing a heavier weight, we may overcome overshrinkage, alleviate biases, and achieve better classification accuracy to some extent at the expense of model sparsity. As shown by Table 2, w = d produced the largest $\widehat{\beta}$ for regulators than its two counterparts. The L1SVM estimates were treated as a yardstick for comparison as to provide an idea of the extent of shrinkage by each weight function. For example, w = 1 and w = $\sqrt{d}$ overly shrank all the regulators under all scenarios as compared with the L1SVM estimates. Note that the binary outcome Y was generated from a logistic regression model while $\widehat{\beta}$ was estimated from a linear model, hence E($\widehat{\beta}$) may be different from β even for an unbiased estimator $\widehat{\beta}$ of the linear model.
Next, we evaluated the performance of the new method for highdimensional data with large p. We used the setup of 50 observations for training, 50 for tuning, and 10,000 for test data. We assumed that (1) the network was composed of either 50 or 100 subnetworks, each having one gene regulating 10 target genes; (2) the first 2 subnetworks were informative resulting in 22 informative genes; (3) the rest of the genes had no effect on the outcome, leading to 528 noise genes when p = 550 and 1,078 noise genes when p = 1, 100; and (4) the true β was specified as in scenario 3. Table 3 shows the simulation results averaged over 100 runs. Again, we see the gains from using a heavy weight (w = d). It prevailed over all the other methods in making accurate classifications and selecting informative genes. The w = $\sqrt{d}$ ranked the second. However, w = d generated models much larger than those from other methods except STDSVM. In this case, the performance of w = 1 is no better than L1SVM possibly due to over shrinkage of the effects of the regulator genes.
Applications to microarray data
To evaluate its performance in the real world, we applied the new method to two microarray gene expression data sets related to the Parkinson's disease (PD) [21] and breast cancer metastasis (BC) [1, 4] respectively.
Parkinson's disease
The data set includes the Parkinson's disease status and the expression levels of 22,283 genes from 105 patients (50 cases and 55 controls) [22]. We used the same network structure as [18]. The network combines 33 Kyoto Encyclopedia of Genes and Genomes (KEGG) regulatory pathways and contains a total of 1,523 genes and 6,865 edges. The data were randomly split into training (40 observations), tuning (20 observations), and test (45 observations) sets. The expression level of each gene was normalized to have mean 0 and standard deviation 1 across samples. The tuning parameter was identified from the tuning set and the performance of the method was evaluated on the test set by the mean classification error and its standard error averaged over 10 runs. Five methods were compared: STDSVM, L1SVM, networkbased SVM with w = 1, w = $\sqrt{d}$, and w = d. To obtain a final model based on the new method with w = $\sqrt{d}$, we combined, for each run, the previous tuning and test data as the new tuning set leading to a sample size as large as 65 observations, on which the classification errors were calculated for wideranging values of the tuning parameter. Then after 10 runs, we had an averaged classification error corresponding to each tuning parameter value. The value that generated the minimal averaged error was the one we selected to fit the final model to all the data. Note that the classification error rate from the final model was likely to be biased due to the double use of the data for training/tuning and test; the main purpose of fitting the final model was to see the selected genes at the end.
First, we focused on the 1,070 genes that appeared in the network with the largest variations of expression levels (i.e., SD of expression levels across the 105 samples ≥ 15). According to the KEGG pathway of Parkinson's disease [23], 20 genes play a role in the disease progression, five of which (UBE1, PARK2, UBB, SEPT5, and SNCAIP) belong to the 1,070 genes. In addition to the classification error, we added two additional criteria for method comparison: the number of disease genes identified, and the number of genes identified. Table 4 shows that STDSVM made the most accurate classification, even though the difference with other methods was perhaps nonsignificant. The w = d ranked the second in predictive performance while produced a model including 70.6 genes on average. In this case, the w = $\sqrt{d}$ gained advantage: it selected more disease genes by a relatively sparse model with a classification error nonsignificantly larger than STDSVM. From the 1,070 genes, with the final model the new method identified 75 genes including one disease gene.
Next, to better integrate the biological observation of the KEGG pathway and the known network structure of [18], we restricted our analysis to the first and secondorderneighbors of the 8 disease genes on the Parkinson's disease KEGG pathway whose expression levels and network structure are available. The firstorderneighbor subnetwork (PD1nbnet) was composed of the 8 disease genes and their 8 direct neighbors. The secondorderneighbor subnetwork (PD2nbnet) comprised the PD1nbnet as well as the direct neighbors of the 8 direct neighbors of the disease genes, leading to a total of 26 genes. Figure 1 displays the two subnetworks. We conducted the analysis in the same way as described above. The only difference resided in that this time only genes appearing in the PD1nbnet and PD2nbnet were included in the analysis. Table 5 shows the results.
We see the gains from employing the new method when narrowing down our focus on the PD1nbnet and PD2nbnet. For the PD1nbnet, w = 1 and w = $\sqrt{d}$ performed equally well. They had the smallest classification error and identified one more disease gene through a model slightly larger than the one obtained from L1SVM. The new method with w = d won over in the case of PD2nbnet with the best accuracy and most selected disease genes. The w = $\sqrt{d}$ ranked the second in terms of the prediction accuracy while detecting 3 more disease genes by a model with 3 more genes than that of the L1SVM. This means that the new method was able to identify more clinically relevant genes while keeping the same number of noise genes in the model as L1SVM. In both subnetworks, the final models included all the genes.
Breast cancer metastasis
The breast cancer metastasis data set [1, 4] contains expression levels of 8,141 genes for 286 patients, 106 of whom were detected to develop metastasis within a 5year followup after surgery. TP53, BRCA1, and BRCA2 are three human genes that belong to the class of tumor suppressor genes, which are known to prevent uncontrolled cell proliferation, and to play a critical role in repairing the chromosomal damage. Certain mutations of these genes lead to increasing risk of breast cancer. We explored the proteinprotein interaction (PPI) network previously used by [1]. The PPI network comprises 57,235 interactions among 11,203 proteins, obtained by assembling various sources of experimental data and curation of the literature [1]. We confined our analysis to the direct or firstorder neighbors (BC1nbnet) of the three cancer genes, and the subnetwork composed of two parts (BC2nbnet): the direct neighbors of TP53, and the secondorder neighbors of BRCA1 and BRCA2. We fit the final model and compared the four methods in terms of classification error, cancer genes selection, and model sparsity. The cancer genes are the 227 known or putative cancer genes with estimated mutation frequencies in cancer samples ([1]). A total of 294 genes that fell into the BC1nbnet had observed expression levels, among which were 40 cancer genes and 7 cancer genes (ABL1, JAK2, p53, PTEN, p14ARF, PTCH, and RB) with mutation frequencies larger than 0.10. The BC2nbnet was composed of 2,070 genes, 1,718 of them with observed expression levels, including 107 cancer genes. Besides the 7 included in BC1nbnet, 7 additional cancer genes (ACH, APC, EGFR, KIT, NICD, RAS, and CTNNB1) that had mutation frequencies larger than 0.10 belonged to BC2nbnet.
For BC1nbnet, w = d had the advantage in selecting cancer genes and those with large mutant frequencies (Table 6). The w = $\sqrt{d}$ detected more clinically relevant genes by a sparser model while reaching a comparable classification error rate to that of L1SVM. Even though the final model was parsimonious, it included 4 cancer genes, one of which had a large mutation frequency. For BC2nbnet, the new method with w = $\sqrt{d}$ detected more cancer genes with equally accurate predictions while maintaining a sparse model compared with L1SVM. The final model included only 23 genes out of 1,718, two of which were cancer genes with one having a large mutation frequency.
Conclusion
The advancement in the microarray technology has enriched the tool kit of researchers to decipher the complexity of disease mechanisms at the genomic level. Studies have been widely conducted to identify genetic markers to better the diagnostic classification and prognostic assessment, largely by ignoring biological knowledge on gene functions and treating individual genes equally and independently a priori. The downside of such an endeavor has been realized; for example, gene markers identified across similar patient cohorts for the same disease in such a way often lack consistency. As a viable alternative, the networkbased approach has been gaining popularity. In addition to improving predictive performance and gene selection, the networkbased approach extracts more biological insights from highthroughput gene expression data. Here we have proposed a networkbased SVM, with a penalty term incorporating gene network information, as a practically useful classification tool for microarray data. Our simulation studies and two real data applications indicate that the proposed method is able to better identify clinically relevant genes and make accurate predictions.
Abbreviations
 SVM:

support vector machine
 STDSVM:

standard support vector machine
 L1SVM:

L1penalized support vector machine
 LP:

linear programming
 PD:

Parkinson's disease
 BC:

Breast cancer
 KEGG:

Kyoto Encyclopedia of Genes and Genomes
 PPI:

protein protein interaction
References
 1.
Chuang HY, Lee EJ, Liu YT, Lee DH, Ideker T: Networkbased classification of breast cancer metastasis. Mol Syst Biol 2007, 3: 140. 10.1038/msb4100180
 2.
Frolov AE, Godwin AK, Favorova OO: Differential gene expression analysis by DNA microarray technology and its application in molecular oncology. Mol Biol 2003, 37: 486–494. 10.1023/A:1025166706481
 3.
Yang TY: The simple classification of multiple cancer types using a small number of significant genes. Mol Diagn Ther 2007, 11: 265–275.
 4.
Wang Y, Klijin JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijervan Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer. Lancet 2005, 365: 671–679.
 5.
Xiong MM, Li WJ, Zhao JY, Li J, Boerwinkle E: Feature (gene) selection in gene expressionbased tumor classification. Mol Genet Metab 2001, 73: 239–247. 10.1006/mgme.2001.3193
 6.
EinDor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 2005, 21: 171–178. 10.1093/bioinformatics/bth469
 7.
EinDor L, Zuk O, Domany E: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci USA 2006, 103: 5923–5928. 10.1073/pnas.0601231103
 8.
Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S: Networkbased analysis of affected biological processes in type 2 diabetes models. PLoS Genet 2007, 3: e96. doi:10.1016/S0140–6736(05)17947–1 doi:10.1016/S01406736(05)179471 10.1371/journal.pgen.0030096
 9.
Cortes C, Vapnik V: Supportvector networks. Machine Learning 1995, 20: 273–297.
 10.
Vapnik V: The Nature of Statistical Learning Theory. New York: Springer; 1995.
 11.
Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D: Knowledgebased analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 2000, 97: 262–267. 10.1073/pnas.97.1.262
 12.
Furey T, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16: 906–914. 10.1093/bioinformatics/16.10.906
 13.
Zou H, Yuan M: The F_{∞}norm Support Vector Machine. Stat Sin 2008, 18: 379–398.
 14.
Wahba G, Lin Y, Zhang H: GACV for support vector machines. In Advances in Large Margin Classifiers. Edited by: Smola A, Bartlett P, Scholkopf B, Schuurmans D. Cambridge, MA: MIT Press; 2000:297–311.
 15.
Hastie T, Tibshirani R, Friedman JH: The Elements of Statistical Learning. New York: Springer; 2001.
 16.
Friedman JH, Hastie T, Rosset S, Tibshirani R, Zhu J: Discussion of boosting papers. Ann Appl Stat 2004, 32: 102–107.
 17.
Pan W, Xie B, Shen X: Incorporating predictor network in penalized regression with application to microarray data. [Manuscript submitted]. [Manuscript submitted].
 18.
Li C, Li H: Networkconstrained regularization and variable selection for analysis of genomic data. Bioinformatics 2008, 24: 1175–1182. 10.1093/bioinformatics/btn081
 19.
Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Statist Soc B 2005, 67: 301–320. 10.1111/j.14679868.2005.00503.x
 20.
Wang L, Zhu J, Zou H: The doubly regularized support vector machine. Stat Sin 2006, 16: 589–615.
 21.
Gene Expression Omnibus: GSE6613[http://www.ncbi.nlm.nih.gov/geo/]
 22.
Scherzer CR, Eklund AC, Morse LJ, Liao Z, Locascio JJ, Fefer D, Schwarzschild MA, Schlossmacher MG, Hauser MA, Vance JM, Sudarsky LR, Standaert DG, Growdon JH, Jensen RV, Gullans SR: Molecular markers of early Parkinson's disease based on gene expression in blood. Proc Natl Acad Sci USA 2007, 104: 955–960. 10.1073/pnas.0610204104
 23.
KEGG: Parkinson's disease[http://cgap.nci.nih.gov/Pathways/Kegg/hsa05020]
Acknowledgements
YZ and WP were partially supported by NIH grants HL65462 and GM081535; XS supported by NIH grant GM081535 and NSF grants IIS0328802 and DMS0604394. We thank Dr Hongzhe Li and Dr Trey Ideker for providing the KEGG network and PPI network data respectively.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S1
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
YZ implemented the methods, did all the experiments and drafted the paper. XS and WP initiated the project. All participated in the writing of the article.
Rights and permissions
About this article
Cite this article
Zhu, Y., Shen, X. & Pan, W. Networkbased support vector machine for classification of microarray samples. BMC Bioinformatics 10, S21 (2009). https://doi.org/10.1186/1471210510S1S21
Published:
Keywords
 Support Vector Machine
 Cancer Gene
 Classification Error
 Neighboring Gene
 Heavy Weight