Network-based support vector machine for classification of microarray samples

Background The importance of network-based approach to identifying biological markers for diagnostic classification and prognostic assessment in the context of microarray data has been increasingly recognized. To our knowledge, there have been few, if any, statistical tools that explicitly incorporate the prior information of gene networks into classifier building. The main idea of this paper is to take full advantage of the biological observation that neighboring genes in a network tend to function together in biological processes and to embed this information into a formal statistical framework. Results We propose a network-based support vector machine for binary classification problems by constructing a penalty term from the F∞-norm being applied to pairwise gene neighbors with the hope to improve predictive performance and gene selection. Simulation studies in both low- and high-dimensional data settings as well as two real microarray applications indicate that the proposed method is able to identify more clinically relevant genes while maintaining a sparse model with either similar or higher prediction accuracy compared with the standard and the L1 penalized support vector machines. Conclusion The proposed network-based support vector machine has the potential to be a practically useful classification tool for microarrays and other high-dimensional data.


Background
The past two decades have witnessed rapid advances in gene expression profiling with the microarray technology, which not only brighten the prospect of deciphering the complexity of disease genesis and progression at the genomic level, but also revolutionize the diagnostic, therapeutic, and prognostic approaches. Up to recently, diag-nostic classification and prognostic assessment have been based on conventional clinical and pathological risk factors, such as patient age and tumor size, many of which are believed to be secondary manifestation [1]. The advent of microarray technology allows researchers to explore primary disease mechanisms by comparing gene expression profiles for malignant and normal cells. The regular-ity and aberration in the expression patterns of certain genes shed light on their functions and pathological importance [2]. Studies that seek to identify gene markers to refine diagnostic classification and improve prognostic prediction in the context of gene expression data have enriched the literature [3][4][5]. In recent years, researchers have realized that gene markers identified from microarrays drawn from difierent studies on the same disease across similar cohorts lack consistency [6,7]. A possibly more effective means to resolve this problem is to employ a network-based approach, that is, to identify markers as gene subnetworks, defined as groups of functionally related genes based on a gene network, instead of treating individual genes as completely independent and identical a priori as in most existing approaches [1]. A novel network-based approach proposed recently [1,8] can be summarized as follows: (1) randomly searching subnetworks and assigning a score to each subnetwork that characterizes the subnetwork-wise gene expression level; (2) identifying significant subnetworks that can well discriminate the clinical outcome; (3) constructing a classifier based on the significant subnetworks with a conventional statistical tool, such as logistic regression. Essentially such a network-based approach aggregates gene expression data at the subnetwork level and then identifies and utilizes some significant subnetworks. It has been shown that such a network-based approach not only improves predictive performance and reproducibility, but also sheds biological insights into molecular mechanisms underlying the clinical outcome. However, the above method is largely heuristic without a formal statistical framework; more importantly, it involves a random search over subnetworks, leading to possibly different results from different runs with no guarantee of the optimality of the final result. Because of the ever-increasing popularity of penalization methods for high-dimensional data, we propose a novel network-based penalty to be used with the hinge loss, leading to a network-based support vector machine. While maintaining some desirable properties of support vector machine (SVM) with the hinge loss function, the network-based penalty directly integrates a biological network to realize more effective variable selection, as compared with generic methods, such as the standard SVM (STD-SVM) or L 1 -penalized SVM (L1-SVM).
The support vector machine (SVM) is one of the most popular supervised learning techniques with wide-ranging applications [9,10]. In particular, previous studies have demonstrated its superior performance in gene expression data analysis, especially its ability to handle high dimensional data [11,12]. Nevertheless, with categorical predictors, both the STD-SVM and the L1-SVM may have some shortcomings. Zou and Yuan [13] applied the concept of grouped variable selection and developed an F ∞ -norm penalized SVM to realize simultaneous selec-tion/elimination of all the features derived from the same categorical factor (or a group of variables). Their numerical examples showed that the F ∞ -norm SVM outperformed the L1-SVM in factor-wise variable selection. We extend the idea of variable grouping to gene networks: rather than grouping all the dummy variables created from the same categorical factor, we treat two neighboring genes in a network as one group. The network-based penalty is constructed as the sum of the F ∞ -norms being applied to the groups of neighboring-gene pairs. With the hinge loss penalized by such a network-based penalty as our objective function, we obtain our network-based SVM. The later sections are organized as follows. We begin with a brief review of the SVM, and then introduce our proposed network-based SVM. We evaluate its performance by simulation studies in both low dimensional and high dimensional data settings as well as two real data applications. The last section concludes the paper with a brief summary.

Existing methods
Suppose we have training data with SVM searches for such a hyperplane that maximizes the margin between the training data points for class 1 and class -1: where ξ i are slack variables, and C is a tuning parameter to be determined. The STD-SVM has an equivalent hinge loss + penalty formulation as an optimization problem [13][14][15]: where the subscript "+" denotes the positive part, i.e., z + = max{z, 0}, , and λ is the tuning parameter. The solution to (1) is the same as that to (2).
The above STD-SVM forces all nonzero coefficient estimates, which leads to the problem of its inability to conduct variable selection. The L1-SVM was proposed to accomplish the goal of variable selection. It can be formulated as where . The L1-SVM wins over the STD-SVM when the true model is sparse, while the STD-SVM is preferred if there are not many redundant noise features [16].
Zou and Yuan [13] pointed out the shortcoming of the L 1norm penalty: even though it encourages parsimonious models, it fails to guarantee successful models in cases of categorical predictors due to the fact that each dummy variable is selected independently. They applied the concept of grouped variable selection and proposed an F ∞norm SVM to realize simultaneous selection/elimination of features derived from the same factor so as to accomplish automatic factor-wise variable selection. Suppose we have G factors F 1 ,...,F G . From each factor F g , we generate a feature vector .
Correspondingly we have the coefficient vector . Therefore, The F ∞ -norm SVM is formulated as The most noteworthy property of the F ∞ -norm SVM is its guarantee of sparsity at the factor level. Due to the singularity property of the infinity norm: || β (g) || ∞ is not differentiable at β (g) = 0, β (g) will be exactly zero if the regularization parameter λ is properly chosen [13]. Therefore, the F ∞ -norm SVM automatically eliminates factors that are completely irrelevant to the response, and thus achieves the goal of factor-wise selection. The empirical evidence shows that the F ∞ -norm SVM often outperforms both the L1-SVM and the STD-SVM.

New method
Biological observations reveal that neighboring genes in a network tend to function together in biological processes.
To incorporate this prior information, a network-based SVM for binary classification is proposed to facilitate generating models that extract more biological insight from gene expression data. The penalty term that characterizes the network structure can be specified by implanting the F ∞ -norm into the context of known functional interrelationships among genes by considering each pair of the functionally related genes as one group.
Consider a gene network with S denoting the set of all edges, i.e., the pair of connected genes. Four properties of the penalty term are noteworthy. First, the regularization is performed at the level of grouped genes with each group containing two neighboring genes in the network. In the case of penalized linear regression, it has been proven that this penalty achieves the goal of eliminating both and simultaneously if (j 1 , j 2 ) ∈ S [17]. The automatic selection of grouped features is due to the singularity of function max{|a|, |b|} [13]. This formulation satisfies our assumption that neighboring genes tend to (or not to) contribute to the same biological process at the same time. Second, the choice of the weight depends on the goal of shrinkage and influences the predictive performance. Consider a network comprised of several subnetworks, each with one regulator and ten target genes. Because of the singularity of function max(|a|, |b|) at a = b, the weighted penalty in the context of penalized regression, encourages [17].
Here we examine three weight functions in particular: w k favor genes with more direct neighbors to have larger coefficient estimates; in other words, heavier weights relax the shrinkage effect for those regulators, which are known to be biologically more important. Due to this property, the choice of a heavy weight, as a simple strategy, enables us to alleviate the bias in the coefficient estimates from the penalization method and possibly improve the p predictive performance. Our default weight is w k = . The weight, considered as another tuning parameter, can be determined from cross-validation or an independent validation data set, though we do not consider it here. Third, the penalty term, under certain conditions, tends to encourage a grouping effect, where highly correlated predictors tend to have similar coefficient estimates [17][18][19][20]. Fourth, the penalty is linear, which allows the solution to be found by the linear programming (LP) technique that is computationally convenient.
As usual, the fitted classifier is , and the classification rule is sign( (x)). We employ LP to obtain the solutions to (8)

Simulation
We conducted several simulation studies to numerically evaluate the performance of the network-based SVM along with the STD-SVM and L1-SVM. The simulation setups were similar to those in [18]. We started from a simple network consisting of 5 subnetworks, each having a regulator gene t (t = 1,...,5) that regulated 10 target genes, leading to a total of 55 genes (p = 55). We assumed that two out of the five subnetworks were informative; that is, the coefficients of 22 genes were nonzero and thus informative to the outcome, while the remaining 33 noise genes had no effect on the outcome. We generated a simulated data set by the following steps: • Generate the expression level of regulator gene t, X t ~ N (0, 1), t = 1,..., 5, independently. β β j j j j w w • Generate the outcome Y from a logistic regression model: Logit (Pr(Y = 1|X)) = X T β + β 0 , β 0 = 2, where X is the vector of the expression levels of all the genes, and coefficient vector .
Four sets of true coefficients, β 's, were specified to reflect four scenarios: 1. .
The effect of one informative subnetwork was the same as the other in magnitude but with an opposite direction.

.
Both informative subnetworks had positive effects but in different magnitudes.

3.
Target genes in the same informative subnetworks had both positive and negative effects.

4.
It was similar to but more extreme than scenario 3.   Table 1 reports the mean classification error of the test set and its standard error (SE in parentheses), the standard deviation of the classification errors divided by the square root of the number of runs, for each method over 100 runs under each scenario. To evaluate each method's ability to select informative genes, we examined the false negatives, defined as the number of informative genes whose coefficients were estimated to be zero. In addition, we also considered a smaller sample size: we repeated the entire process with 50 training data points, 50 tuning data points, and again 10,000 test data points. The network-based SVM is named as "New" in the table.
According to our simulation setups, the correct weight function should be w = . However, we find that the new method with w = d overwhelmingly beat all other methods in all the setups. It consistently made the most accurate classifications and missed no informative genes.
The new method with w = performed the second best: in most cases, it improved the classification accuracy over STD-SVM and L1-SVM; and under all the settings, it produced models that identified more informative genes than the L1-SVM. In contrast, w = 1 did not bring much gains over the STD-SVM or the L1-SVM. The L1-SVM led to models that were too sparse, missing about 14 and 11 informative genes for n = 50 and n = 100 respectively. The superior performance and the larger model size of the heavy weight (w = d) compared with its counterparts (w = 1 and w = ) is presumably due to its relaxation of the shrinkage effect. The penalization methods shrink the toward zero by imposing the constraints (the penalty term) and therefore introduces bias to . By grouping neighboring genes, the new method encourages the pairwise weighted absolute coefficients to be equal. Therefore, a heavy weight leads to larger | | for regulator genes. By choosing a heavier weight, we may overcome over-shrinkage, alleviate biases, and achieve better classification accu-racy to some extent at the expense of model sparsity. As shown by Table 2, w = d produced the largest | | for regulators than its two counterparts. The L1-SVM estimates were treated as a yardstick for comparison as to provide an idea of the extent of shrinkage by each weight function.
For example, w = 1 and w = overly shrank all the regulators under all scenarios as compared with the L1-SVM estimates. Note that the binary outcome Y was generated from a logistic regression model while was estimated from a linear model, hence E( ) may be different from β even for an unbiased estimator of the linear model.
Next, we evaluated the performance of the new method for high-dimensional data with large p. We used the setup of 50 observations for training, 50 for tuning, and 10,000 for test data. We assumed that (1) the network was composed of either 50 or 100 subnetworks, each having one gene regulating 10 target genes; (2) the first 2 subnetworks were informative resulting in 22 informative genes; (3) the rest of the genes had no effect on the outcome, leading to 528 noise genes when p = 550 and 1,078 noise genes when p = 1, 100; and (4) the true β was specified as in scenario 3. Table 3 shows the simulation results averaged over 100 runs. Again, we see the gains from using a heavy weight (w = d). It prevailed over all the other methods in making accurate classifications and selecting informative genes. The w = ranked the second. However, w = d generated models much larger than those from other methods except STD-SVM. In this case, the performance of w = 1 is no better than L1-SVM possibly due to over shrinkage of the effects of the regulator genes.

Applications to microarray data
To evaluate its performance in the real world, we applied the new method to two microarray gene expression data sets related to the Parkinson's disease (PD) [21] and breast cancer metastasis (BC) [1,4] respectively.

Parkinson's disease
The data set includes the Parkinson's disease status and the expression levels of 22,283 genes from 105 patients (50 cases and 55 controls) [22]. We used the same network structure as [18].   on which the classification errors were calculated for wide-ranging values of the tuning parameter. Then after 10 runs, we had an averaged classification error corresponding to each tuning parameter value. The value that generated the minimal averaged error was the one we selected to fit the final model to all the data. Note that the classification error rate from the final model was likely to be biased due to the double use of the data for training/ tuning and test; the main purpose of fitting the final model was to see the selected genes at the end.
First, we focused on the 1,070 genes that appeared in the network with the largest variations of expression levels (i.e., SD of expression levels across the 105 samples ≥ 15). According to the KEGG pathway of Parkinson's disease [23], 20 genes play a role in the disease progression, five of which (UBE1, PARK2, UBB, SEPT5, and SNCAIP) belong to the 1,070 genes. In addition to the classification error, we added two additional criteria for method comparison: the number of disease genes identified, and the number of genes identified. Table 4 shows that STD-SVM made the most accurate classification, even though the difference with other methods was perhaps non-significant. The w = d ranked the second in predictive perform-ance while produced a model including 70.6 genes on average. In this case, the w = gained advantage: it selected more disease genes by a relatively sparse model with a classification error non-significantly larger than STD-SVM. From the 1,070 genes, with the final model the new method identified 75 genes including one disease gene.
Next, to better integrate the biological observation of the KEGG pathway and the known network structure of [18], we restricted our analysis to the first-and second-orderneighbors of the 8 disease genes on the Parkinson's disease KEGG pathway whose expression levels and network structure are available. The first-order-neighbor subnetwork (PD-1nb-net) was composed of the 8 disease genes and their 8 direct neighbors. The second-order-neighbor subnetwork (PD-2nb-net) comprised the PD-1nb-net as well as the direct neighbors of the 8 direct neighbors of the disease genes, leading to a total of 26 genes. Figure 1 displays the two subnetworks. We conducted the analysis in the same way as described above. The only difference resided in that this time only genes appearing in the PD-1nb-net and PD-2nb-net were included in the analysis. Table 5 shows the results.
We see the gains from employing the new method when narrowing down our focus on the PD-1nb-net and PD-2nb-net. For the PD-1nb-net, w = 1 and w = performed equally well. They had the smallest classification error and identified one more disease gene through a model slightly larger than the one obtained from L1-SVM.
The new method with w = d won over in the case of PD-2nb-net with the best accuracy and most selected disease genes. The w = ranked the second in terms of the prediction accuracy while detecting 3 more disease genes by a model with 3 more genes than that of the L1-SVM. This means that the new method was able to identify more clinically relevant genes while keeping the same number of noise genes in the model as L1-SVM. In both subnetworks, the final models included all the genes.   d

Breast cancer metastasis
The breast cancer metastasis data set [1,4] contains expression levels of 8,141 genes for 286 patients, 106 of whom were detected to develop metastasis within a 5-year follow-up after surgery. TP53, BRCA1, and BRCA2 are three human genes that belong to the class of tumor suppressor genes, which are known to prevent uncontrolled cell proliferation, and to play a critical role in repairing the chromosomal damage. Certain mutations of these genes lead to increasing risk of breast cancer. We explored the pro-  tein-protein interaction (PPI) network previously used by [1]. The PPI network comprises 57,235 interactions among 11,203 proteins, obtained by assembling various sources of experimental data and curation of the literature [1]. We confined our analysis to the direct or first-order neighbors (BC-1nb-net) of the three cancer genes, and the subnetwork composed of two parts (BC-2nb-net): the direct neighbors of TP53, and the second-order neighbors of BRCA1 and BRCA2. We fit the final model and compared the four methods in terms of classification error, cancer genes selection, and model sparsity. The cancer genes are the 227 known or putative cancer genes with estimated mutation frequencies in cancer samples ( [1]). A total of 294 genes that fell into the BC-1nb-net had observed expression levels, among which were 40 cancer genes and 7 cancer genes (ABL1, JAK2, p53, PTEN, p14ARF, PTCH, and RB) with mutation frequencies larger than 0.10. The BC-2nb-net was composed of 2,070 genes, 1,718 of them with observed expression levels, including 107 cancer genes. Besides the 7 included in BC-1nb-net, 7 additional cancer genes (ACH, APC, EGFR, KIT, NICD, RAS, and CTNNB1) that had mutation frequencies larger than 0.10 belonged to BC-2nb-net.
For BC-1nb-net, w = d had the advantage in selecting cancer genes and those with large mutant frequencies ( Table   6). The w = detected more clinically relevant genes by a sparser model while reaching a comparable classification error rate to that of L1-SVM. Even though the final model was parsimonious, it included 4 cancer genes, one of which had a large mutation frequency. For BC-2nb-net, the new method with w = detected more cancer genes with equally accurate predictions while maintaining a sparse model compared with L1-SVM. The final model included only 23 genes out of 1,718, two of which were cancer genes with one having a large mutation frequency.

Conclusion
The advancement in the microarray technology has enriched the tool kit of researchers to decipher the complexity of disease mechanisms at the genomic level. Studies have been widely conducted to identify genetic markers to better the diagnostic classification and prognostic assessment, largely by ignoring biological knowledge on gene functions and treating individual genes equally and independently a priori. The downside of such an endeavor has been realized; for example, gene markers identified across similar patient cohorts for the same disease in such a way often lack consistency. As a viable alternative, the network-based approach has been gaining popularity. In addition to improving predictive performance and gene selection, the network-based approach extracts more biological insights from high-throughput gene expression data. Here we have proposed a networkbased SVM, with a penalty term incorporating gene network information, as a practically useful classification tool for microarray data. Our simulation studies and two real data applications indicate that the proposed method is able to better identify clinically relevant genes and make accurate predictions.