 Methodology article
 Open access
 Published:
Gene and pathway identification with L_{ p }penalized Bayesian logistic regression
BMC Bioinformatics volume 9, Article number: 412 (2008)
Abstract
Background
Identifying genes and pathways associated with diseases such as cancer has been a subject of considerable research in recent years in the area of bioinformatics and computational biology. It has been demonstrated that the magnitude of differential expression does not necessarily indicate biological significance. Even a very small change in the expression of particular gene may have dramatic physiological consequences if the protein encoded by this gene plays a catalytic role in a specific cell function. Moreover, highly correlated genes may function together on the same pathway biologically. Finally, in sparse logistic regression with L_{ p }(p < 1) penalty, the degree of the sparsity obtained is determined by the value of the regularization parameter. Usually this parameter must be carefully tuned through crossvalidation, which is time consuming.
Results
In this paper, we proposed a simple Bayesian approach to integrate the regularization parameter out analytically using a new prior. Therefore, there is no longer a need for parameter selection, as it is eliminated entirely from the model. The proposed algorithm (BLpLog) is typically two or three orders of magnitude faster than the original algorithm and free from bias in performance estimation. We also define a novel similarity measure and develop an integrated algorithm to hunt the regulatory genes with low expression changes but having high correlation with the selected genes. Pathways of those correlated genes were identified with DAVID http://david.abcc.ncifcrf.gov/.
Conclusion
Experimental results with gene expression data demonstrate that the proposed methods can be utilized to identify important genes and pathways that are related to cancer and build a parsimonious model for future patient predictions.
Background
Gene selection and cancer prediction with microarray data have been studied extensively in recent years. Most earlier studies concentrated on identifying a small number of discriminatory genes with different statistical and machine learning methods [1–3]. Many statistical learning techniques such as support vector machines [4], the relevance vector machines (RVM) [5, 6], LASSO [7–9], and sparse logistic regression [10–12] have been applied to this problem. There are two common goals for such algorithms: The first is to distinguish cancer and noncancer patients with the highest possible accuracy. The second is to identify a small subset of genes that are highly differentiated in different classes and to associate gene expression patterns with disease status. The genes identified with the second aim may improve our understanding of the underlying causes of the cancer. In gene selection, when genes share the same biological pathway, the correlation between them can be high [13], and those genes form a group. The ideal gene selection methods eliminate the trivial genes and automatically include the whole group genes into the model once one gene among them is selected. Most importantly, almost all of the current methods are biased towards selecting those genes that display the most pronounced expression differences. Such methods select genes using purely statistical criteria (either rank score or classification accuracy) and this selection is thought to reflect their relative importance. Quite often, a certain number of genes with the smallest pvalues or highest prediction accuracy are finally selected, while most biologists recognize that the magnitude of differential expression does not necessarily indicate biological significance. From the biological prospective, even a very small change in the expression of a particular gene may have dramatic physiological consequences if the protein encoded by this gene plays a catalytic role in a specific cell function [14]. Many other downstream genes may amplify the signal produced by this truly interesting gene, thereby increasing their chance of being selected by current gene selection methods. For a regulatory gene, however, the chance of being selected by such methods may diminish as one keeps hunting for downstream genes that tend to show much bigger changes in their expression. As a result, the initial list of candidate genes may be enriched with many effector genes that do little to elucidate more fundamental mechanisms of biological processes. Therefore we have to deal with two important problems in gene selection: (1) how to take into account the genegene correlations and (2) how to hunt the upstream regulatory genes. The characteristic of the regulatory genes is that their gene expression changes may be low, but they are highly correlated with the downstream highly expressed genes. Although there is ongoing research to incorporate prior biological knowledge, such as partially known pathways in gene selection [15], to the best of our knowledge, there is no efficient method to hunt the upstream regulatory genes in gene selection and pathway discovery. There is, therefore, a pressing need for new algorithms to be developed.
In this paper, we propose a substantial improvement to the sparse logistic regression (SparseLOGREG) approach [12]. The SparseLOGREG algorithm employs a L_{ p }norm regularization [16], which is equivalent to super Laplace prior over the model parameters. Both the generalization ability of the model and the sparsity achieved are critically dependent on the value of a regularized parameter, which has to be carefully tuned to the best performance. This best parameter can only be found through crossvalidation and computationally intensive search. In this paper, the regularization parameter, however, will be integrated out analytically using a new prior that is similar to the uninformative Jeffery's prior [11]. The resulting algorithm (BLpLog) has the comparable performance with the original algorithm but is much faster, as there is no longer a need for parameter optimization. The goal of the current study is to develop a computationally affordable and wellbehaved estimating approach, which can effectively identify cancer related genes and pathways. We propose an integrated method that first identifies a small subset of cancer related genes utilizing the L_{ p }regularized Bayesian logistic regression (BLpLog), and then define a novel similarity measure to identify the regularized genes that are highly correlated with each gene in the subset. Finally, we annotate the regularized genes and identify the cancer related pathways using DAVID.
Results and discussion
In this section, we evaluate the performance of proposed L_{ p }penalized Bayesian logistic regression (BLpLog) methods and the integrated algorithm using several microarry data. We compare proposed method with SparseLOGREG [12] and BLogReg [11].
SparseLOGREG includes a regularization parameter, controlling the complexity of the model and the sparsity of the model parameters, which must be chosen by the user or alternatively optimized in an additional model selection stage. Therefore, the value of this parameter is found via a (computationally expensive) maximization of the crossvalidation estimate of the area under the ROC curve (AUC). However, we cannot use the same crossvalidation estimate for both model selection and performance evaluation as this would introduce a strong selection bias in favor of the existing sparse SparseLOGREG model. A nested crossvalidation procedure is therefore used instead. 10fold cross validation is used for performance evaluation in the 'outer loop' of the procedure, in each iteration of which model selection is performed individually for each classifier based on a separate leaveoneout crossvalidation procedure using the training data only. Because of the small sample size and high dimensional genes, leaveoneout cross validation in the 'inner loop' likely provide a reliable performance measure for model selection. Even though this nested crossvalidation is computationally expensive, it provides an almost unbiased assessment of generalization performance as well as a sensible automatic method of setting the value of the regularization parameter. We do not need model selection with both BLpLog and BLogReg, only 10fold cross validation is used for performance evaluation. Finally, we find the highly correlated upstream genes with proposed correlation measure and identify the related pathways using DAVID.
Breast Cancer Data Set [17]
98 primary breast cancers (34 from patients who developed distant metastases within 5 years, 44 from patients who continue to be diseasefree after a period of at least 5 years, 18 from patients with BRCA1 germline mutations and 2 from BRCA2 carriers) have been selected from patients who were lymph node negative and under 55 years of age at diagnosis. There is a total of 24188 genes. This data set contained some missing values. Gene expression levels lacking for all patients are left out. The rest of the missing values are estimated based on the correlations between gene expressions.
We apply the proposed integrated algorithm to the data. BLpLog identifies 11 genes with the 10crossvalidation AUC = 0.976. These 11 genes are highly differentiated in patients with and without metastases. SparseLOGREG selects 10 highly differentiated genes with predicted AUC = 0.981, and BLogReg identifies 14 genes with the predicted AUC = 0.953. Both SparseLOGREG and BLpLog outperform BLogReg with higher AUC value and less genes. but that the difference in performance between the SparseLOGREG and BLpReg algorithms is minimal. The BLpLog algorithm is marginally more computational expensive than the BLogReg with multiple initializations. It takes 5 minutes compare 1.7 minutes with BLogReg algorithm. The SparseLOGREG algorithm is very much more expensive, owing to the need for a model selection stage to choose a good value for the regularization parameter. It takes roughly 4 hours on the same PC. Given the minimal difference in performance and substantial difference in computational expense there is little reason to prefer the SparseLOGREG over the BLpLog algorithm. We find the correlated genes for each of the 11 selected genes using the criteria R > 0.9 and identify pathways that associated with those genes with DAVID. The 11 genes with BLpLog are listed in Table 1. Each pathway is identified in such a way that the statistical significance of the pathway is the highest (p value is the smallest) in DAVID. The '+' and '' signs in column 2 of Table 1 indicate that the selected gene is either overexpressed or downexpressed for patients with metastases. The total number of highly correlated genes with each selected gene is given in column 4. The highly correlated genes on a KEGG pathway are shown in Table 2. Six pathways associated with 7 selected genes are identified. The correlated genes of the 4 other selected do not have a KEGG pathway associated with them. The plots of T cell receiptor signaling and MAPK signaling pathway are shown in Figure 1 – Figure 2. The overexpressed and and downexpressed genes on the pathway are shown in red and blue respectively. Each of these six pathways plays an important role in breast cancer survivals. For instance, JAK/STAT signaling pathway is the principal signaling mechanism for a wide array of cytokines and growth factors. JAK activation stimulates cell proliferation, differentiation, cell migration and apoptosis. These cellular events are critical to hematopoiesis, immune development, mammary gland development and lactation, adipogenesis, sexually dimorphic growth and other processes. Predictably, mutations that reduce JAK/STAT pathway activity affect these processes. Conversely, mutations that constitutively activate or fail to regulate JAK signaling properly cause inflammatory disease, erythrocytosis, gigantism and different cancers. Moreover, LEUKOCYTE TRANSENDOTHELIAL MIGRATION provides relevant information about how cells interact with the endothelium and transmigrate. Transendothelial migration of cancer cells from the vasculature into tissue stroma is a final step in the metastatic cascade, prior to formation of secondary tumors. Patients who developed distant metastases in less than 5 years and those who had no distant metastases have 8 genes differentially expressed. The proposed integrated algorithm provides information not only about the set of genes involved on these pathways, but also about how genes interact and regulate each other. In this manuscript, we will only discuss both T CELL RECEPTOR SIGNALING PATHWAY and MAPK SIGNALING PATHWAY in more detail. Other pathways can be analyzed in a similar fashion.
T Cell Receptor (TCR) Signaling (Figure 1) induces activation of multiple tyrosine kinases, resulting in the phosphorylation of numerous intracellular substrates. One of the first steps in the generation of the immune response is the recognition by T lymphocytes of peptide fragments (antigens) derived from foreign pathogens that are presented on the surface of antigen presenting cells (APC). This event is mediated by the T cell receptor (TCR), which transduces these extracellular signals by initiating a wide array of intracellular signaling pathways. This signaling pathway is one of the identified targets for breast cancer drug development. We identified 5 genes on the pathway: CBL, VAV, MALT1, CHUK (IKKα), and IL2. VAV2, MALT1, and CHUK(IKKα) are also on the Bcell receiptor signal pathway. Among them, only gene VAV2 is downexpressed in patients with distant metastases in less than 5 years. The other 4 genes are upexpressed. VAV2 is an oncogene and plays a critical role in hematopoietic signal transduction. The downexpressed Vav2 has been implicated in breast cancer metastasis and may prove to be very important in the aberrant activation of Rho GTPases during the metastatic cascade. The other 4 overexpressed genes are also very important for breast and other cancers and were well studied in the literature. For example, the CBL oncogene was first identified as part of a transforming retrovirus which induces mouse preB and proB cell lymphomas. As an adaptor protein for receptor proteintyrosine kinases, it positively regulates receptor proteintyrosine kinase ubiquitination in a manner dependent upon its variant SH2 and RING finger domains. Ubiquitination of receptor proteintyrosine kinases terminates signaling by marking active receptors for degradation. MALT1, CHUK, and IL2 are also important oncogene identified. These 3 genes have the causal relations as shown in Figure 1. MALT1 is defined as the mucosa associated lymphoid tissue lymphoma translocation gene 1. The overexpressed MALT2 in patients with distant metastasis causes the overexpressed CHUK, and then the overexpressed IL2. Therefore, MALT1 is essential for T cell activation, proliferation, and IL2 production. If MALT1 is not present, both CHUK and IL2 will shutoff.
The MitogenActivated Protein Kinase (MAPK) signaling pathway (Figure 2) transduces a large variety of external signals, leading to a wide range of cellular responses, including growth, differentiation, inflammation and apoptosis. The MAPK signaling pathway has been linked to being responsible for the malignant phenotype, including increased proliferation, defects in apoptosis, invasiveness and ability to induce neovascularization. Consequently, different therapies towards inhibiting the pathway are under development. Nine genes were identified on the pathway. Patients with metastases in less than 5 years are overexpressed in 8 genes and downexpressed in one gene (Table 2 and Figure 2).
There are several causal relations among them. For instance, EGFR belong to the family of epidermal growth factor receptors and has been proven to play major roles in different histological types of breast cancer. The overexpressed EGFR in patients with metastases may be responsible for the upexpressed STMN1 and the downexpressed ELK4 (Sap1a). ELK4 is a downstream gene on the MAPK pathway. Moreover, the the overexpressed MAP2K3 (MKK3) causes the overexpressed MAPK12(P38), and then causes the downexpressed ELK4 (Sap1a). The systematic review of the interactions among the correlated genes on a specific pathway provides us more information about how various genes interact with each other and which gene plays a catalytic role and is more important. EGFR3 is certainly a more important upstream gene and mutations that lead to EGFR overexpression (or overactivity) have been associated with a number of cancers. The overexpressed EGFR in patients with metastases has led to the development of anticancer therapeutics directed against EGFR.
Hepatocellular carcinoma data set [18]
mRNA expression profiles in tissue specimens from 60 hepatocellular carcinoma tissues of which 20 suffer from early intrahepatic recurrence and 40 do not. The number of gene expression levels is 7129. Since hepatocellular carcinoma has a poor prognosis because of the high intrahepatic recurrence rate, the original goal is to predict early intrahepatic recurrence or nonrecurrence. With the proposed integrated algorithm, we can identify not only the highly differentiated genes but also the related pathways.
BLpLog identifies 8 highly differentiated genes with the test AUC = 0.93 with the computational time of 3.2 minutes. BLogReg selects 13 genes with the predicted AUC = 0.90 and computational time of 1.6 minutes. SparseLOGREG identifies 10 genes with the predicted AUC = 0.936 and computational time of 127 minutes (2 hours). The selected genes with different methods are not completely the same but highly correlated. Again with the minimal difference in performance and big differences in computational time between BLpLog and SparseLOGREG, obviously BLpLog is preferred. The highly correlated genes and corresponding pathways are selected with the integrated algorithm. Table 3 and Table 4 are the computational results. Seven pathways was identified from the data. The plots of Antigen Processing and Presentation and Axon Guidance pathways are shown in Figure 3 – Figure 4. All eight pathways identified are important in hepatocellular carcinoma and other cancers. For example, PURINE METABOLISM pathway is one of the metabolism pathways involved in nucleotide synthesis and degradation, amino acid catabolism, nonessential amino acid synthesis and the urea cycle. Understanding the mechanism involved in metabolic regulation has important implications in both biotechnology and medicine. It is estimated that at least a third of all serious health problems are caused by metabolic disorders. Analyzing differentiated expressed genes on the pathway may provide some insight on the early intrahepatic recurrence of hepatocellular carcinoma after curative resection. Other pathways such as CYTOKINECYTOKINE RECEPTOR INTERACTION, NEUROACTIVE LIGANDRECEPTOR INTERACTION, MAPK SIGNALING PATHWAY, and GAP JUNCTION are all hepatocellular carcinoma related. We will discuss the two pathways ANTIGEN PROCESSING AND PRESENTATION and AXON GUIDANCE in more details.
ANTIGEN PROCESSING AND PRESENTATION are processes that occur within a cell that result in fragmentation (proteolysis) of proteins, association of the fragments with MHC molecules, and expression of the peptideMHC molecules at the cell surface where they can be recognized by the T cell receptor on a T cell. However, the path leading to the association of protein fragments with MHC molecules differs for class I and class II MHC. MHC class I molecules present degradation products derived from intracellular (endogenous) proteins in the cytosol. MHC class II molecules present fragments derived from extracellular (exogenous) proteins that are located in an intracellular compartment. Both MHC class I and class II pathways play an important role in antitumor immune responses. Patients with early intrahepatic recurrence of hepatocellular carcinoma (HCC) have the downexpressed expression both on MHC I and II pathways. It is generally acknowledged that tumors usually escape from host immune surveillance by dysfunction or defect of MHC I and MHC II presentation pathways with the downexpressed genes. Therefore, the downexpressed genes on the pathway may be one of the critical reasons for early intrahepatic recurrence. The causal relations among genes can also be identified in Figure 3. The downexpressed MHCI gene may cause the downexpressed β 2m and the downexpressed MHCII and/or LI gene may cause the downexpressed SLIP and the downexpressed SLIP may cause the downexpressed CLIP. These causal relations may provide some implications on developing medicines against hepatocellular carcinoma. AXON GUIDANCE (also called axon pathfinding) is a subfield of neural development concerning the process by which neurons send out axons to reach the correct targets. Many axon guidance molecules may regulate cell migration and apoptosis in normal and tumorigenic tissues. Recent studies have shown that they are widely expressed outside the nervous system and that they may play important roles in HCC. Genes and their interactions are shown in Figure 4. For example, mutations in the ras oncogenes have been linked to many different cancers. Ras gene is overexpressed for HCC patients with early intrahepatic recurrence. The causal relations and genegene interactions are also shown in Figure 4. For instance, the downexpressed EpbB gene in patients with intrahepatic recurrence may cause the overexpressed ras gene through MAPK signaling pathway. The overexpressed DCC gene in patients with intrahepatic recurrence may cause the downexpressed Nck1 and overexpressed CALN, and so on. These genegene interactions may have prognostic implications for HCC.
Highgrade glioma data set [19]
50 highgrade glioma samples were carefully selected, 28 glioblastomas and 22 anaplastic oligodendrogliomas, all were primary tumors sampled before therapy. The classic subset of tumors were cases diagnosed similarly by all examining pathologists, and each case resembled typical depictions in standard textbooks. A total of 21 classic tumors was selected, and the remaining 29 samples were considered nonclassic tumors, lesions for which diagnosis might be controversial. Affymetrix arrays are used to determine the expression of over 12000 genes. The original goal is to separate the glioblastomas from the anaplastic oligodendrogliomas, which allows appropriate therapeutic decisions and prognostic estimation. The number of gene expression levels is 12625. Our goal is to identify genes and corresponding pathways associated with malignant gliomas.
BLpLog has identified 14 genes that are highly differentiated expressed in glioblastomas and anaplastic oligodendrogliomas (predicted AUC = 0.98). Eight pathways and associated correlated genes are identified with the integrated algorithm. The computational results are given in Table 5 and 6. The plots of FOCAL ADHESION and RIBOSOME pathways are shown in Figure 5 – Figure 6. The eight identified pathways are important in malignant gliomas and other diseases. For instance, COMPLEMENT AND COAGULATION CASCADES are composed of serine proteases that are activated through partial cleavage by an upstream enzyme. The elements of these cascades share several common structural characteristics, including a highly conserved catalytic site composed of Ser, His and Asp. The common principle underlying the organization of these systems is that proteases exist as inactive zymogens and are subsequently activated by upstream, active proteases. The initial activation might occur as a result of contact with a nonenzymatic ligand or cleavage by another protease. Understanding the interplay between complement and coagulation has fundamental clinical implications in the context of cancers with an inflammatory pathogenesis. Migration and invasion are important prerequisites for the infiltrative and destructive growth patterns of malignant gliomas. The glioma cell invasiveness depends on proteases of the coagulation and complement cascades. Another pathway, GLUTATHIONE METABOLISM, works through the operation of a group of enzymes called glutathione Stransferases (GST). Glutathione (GSH) plays a critical role in cellular mechanisms that result in cell death. The high glutathione levels may cause resistance to chemotherapy drugs. One interesting study by researchers in Texas showed that your chances of surviving a type of brain cancer, called primary malignant glioma, could depend on the type of glutathionestransferase (GST) gene you were born with. Therefore, it is possible to target glutathione metabolism in the prevention and treatment of malignant gliomas. Here we discuss the FOCAL ADHESION and RIBOSOME pathways in more details.
FOCAL ADHESIONS are large, dynamic protein complexes through which the cytoskeleton of a cell connects to the extracellular matrix, or ECM. They can be considered as subcellular macromolecules that mediate the regulatory effects (e.g. cell anchorage) of extracellular matrix (ECM) adhesion on cell behavior. Focal adhesions kinase (FAK) contributes to glioma growth and invasion. FAK integrates signals from activated growth factor receptors and integrins to regulate cell motility, invasion, proliferation, apoptosis, and angiogenesis. It, therefore, promotes tumor growth, and a role for FAK in glioma pathogenesis is suggested by its expression and localization. FAK genes are overexpressed on the pathway for glioblastoma patients as shown in Figure 5. These overexpressed genes may have many potential protumorigenic functions and produce chemotherapy resistance in the glioblastoma patients. The causal relations and genegene interactions are also shown in Figure 5. The overexpressed ECM gene is a causal gene that causes several other genes to be overexpressed. If ECM is not present, the entire pathway is shut off. Conversely, the expression changes of downstream genes such as MLC and PAK may not affect the pathway as much.
Ribosome biology is huge and accounts for up to 50% of transcriptional activity of cells. The RIBOSOME pathway is composed of genes that encode various proteins of the ribosomal subunits. These proteins need to interact physically with each other to form a large protein complex, the ribosome, and are thereby closely related functionally. Ribosome biogenesis and translation are regulated at multiple levels and are associated with accurate cell growth and proliferation. Several tumor suppressors and protooncogenes have been found either to affect the formation of the mature ribosome or to regulate the activity of proteins known as translation factors. Disruption in one or more of the steps that control protein biosynthesis has been associated with alterations in the cell cycle and regulation of cell growth. Therefore, certain tumor suppressors and protooncogenes might regulate malignant progression by altering the protein synthesis machinery. The protein genes on RIBOSOME pathway (Figure 6) are downexpressed in glioblastoma patients. These downexpressed genes may provide some insight on how to develop new drugs and cancer therapies to target the RIBOSOME pathway involved in malignant gliomas.
Conclusion
We have developed a Bayesian L_{ p }Logistic regression (BLpLog) method, defined a novel correlation measure, and proposed an integrated algorithm for gene selection and pathway identification. We have demonstrated that the simple Bayesian approach to integrating the regularization parameter out analytically performed well on prediction. The integrated algorithm can identify cancer associated genes and KEGG pathways efficiently with the test data sets. The correlation measure we defined can be used to hunt those upstream regularized genes with low expression levels, but are strongly correlated with the downstream highly differentiated genes identified with BLpLog. Almost all of the pathways found in this manuscript cannot be identified with the traditional correlation coefficient. The identified pathways can provide information on genegene interactions and causal relations for genes on the pathway. The knowledge of genegene interaction, gene regulation, and biological pathways can be applied to understanding the mechanisms of how pathway regulations have changed in different subtypes of cancer patients.
Mining high throughput data from different aspects can help us understand the cancer biology better. Our method provided much more information than we presented. For instance, we found hundreds of correlated genes for each downstream gene identified with BLpLog. Only a small proportion of the correlated genes is on the known KEGG pathways. We did not explore the functions and causal relations of the rest of the genes. Moreover, although there may be multiple pathways for each gene, we only reported the top pathway with the highest count of genes. We will infer the gene regulatory networks with the rest of the correlated genes and explore multiple pathways in the near future. Finally, gene set enrichment analysis [20] is a popular tool for evaluating microarray data at the level of gene sets. It, first, utilize a statistical test to identify highly differentiated upstream genes in two classes, and then define gene sets based on prior biological knowledge. There are two drawbacks with this method: (1) it is solely based on the partially known biological knowledge and (2) it cannot guarantee the upstream regularized genes to be selected in the set. It is, therefore, of great interests to incorporate the proposed correlation measure R into gene enrichment analysis, so that we can make sure the upstream regularized genes in the studied gene sets. This is the work of our future research.
Methods
L_{ p }Regularized Sparse Logistic Regression
A general binary classification problem may be simply described as follows. Given n samples, D = {(x_{1}, y_{1}), ..., (x_{ n }, y_{ n })}, where x_{ i }is a multidimensional input vector with dimension d and class label y_{ i }∈ {1, 1}, find a classifier f(x) such that for any input x with class label y, f(x) predicts class y correctly. The logistic regression is:
where w = (w_{1}, ..., w_{ d })^{T}are the parameters which can be estimated through maximizing the log likelihood or minimizing the negative log likelihood.
Different prior assumptions of w in the maximum aposteriori (MAP) estimation will lead to different regularization terms. The sparse parameter estimates can be achieved with L_{ p }regularization:
E = E_{ D }+ λL_{ p }, (1)
where
where λ ≥ 0 is a regularization parameter which must be tuned and {L}_{p}={\displaystyle {\sum}_{j=0}^{d}{w}_{j}{}^{p}} is the regularization term.
Bayesian Regularization
Choosing the best regularization parameter λ through crossvalidation is time consuming. Therefore we propose a Bayesian regularization framework to integrate out the parameter following the similar methods of [11, 21, 22]. Minimizing E in equation (1) has the straight forward Bayesian interpretation. The posterior distribution for w is given by
where p(D) = ∫ P(Dw)p(wλ)dw dλ is a normalization factor that ensures that the posterior integrates to 1 and is given by an integral over the parameter space. The distribution of p(Dw) is defined as:
Let's {E}_{{w}_{i}}={w}_{i}{}^{p}, and we have {L}_{p}={\displaystyle {\sum}_{i=1}^{m}{E}_{{w}_{i}}}, where m is the number of nonzero model parameters and obviously m ≤ d. The prior over model parameter, w, can be defined as
where C(\lambda )={\displaystyle \int {\displaystyle {\prod}_{i=1}^{m}\mathrm{exp}(\lambda {L}_{p})\text{d}w}} is a normalization constant for a given λ. Taking the negative logarithm of the posterior density of equation (2), we have log p(wD) =  log P(Dw)  log p(wλ) + log p(D).
Therefore,
 log p(wD) = E_{ D }+ λL_{ p }+ log p(D)  m log λ + log(C(λ)). (4)
The hyperparameter λ can be integrated out analytically in the prior distribution p(wλ).p(w) = ∫ p(wλ)p(λ)dλ.
As λ is a scaler, we can assign a new prior such that p(λ) ∝ C(λ)/λ. Substituting equation (3), we have
Using the Gamma integral {\displaystyle {\int}_{0}^{\infty}{x}^{v1}{e}^{\mu x}\text{d}x}=\frac{\Gamma (v)}{{\mu}^{v}}, we get
Since p(wD)=\frac{P(Dw)p(w)}{p(D)}, we have
 log p(wD) = E_{ D }+ m log L_{ p }+ log p(D)  log Γ(m). (5)
Let Q = E_{ D }+ m log L_{ p }, comparing equation (4) with (5), we haveE = E_{ D }+ λL_{ p }= E_{ D }+ m log L_{ p }+ R = Q + R,
where R = log p(D)  log Γ(m) is a constant not related to w. Therefore, E and Q are identical up to an additive constant and minimizing E is equivalent to minimizing the Bayesian regularization error function Q.
The BLpLog Algorithm
To find the optimal value of Q and w, we need to have the first and/or second order derivative with gradient based methods. One difficulty with L_{ p }is that it is not differentiable at zero and a differentiable approximation of L_{ p }has to be used. Differentiable approximations typically have a parameter that controls the tradeoff between the smoothness of the approximation and the closeness of the nondifferentiable function which is being approximated. One approximation which works for p ≤ 1 is
where γ is the smoothing parameter. With this differentiable approximation, we get the following modified error function:
Note that Q_{ γ }→ Q as γ → 0.
Given a small value γ, the gradient can be calculated as:
where
where vec{·} represents a vector whose ith element is given by the expression inside the brackets. The Hessian of the objective function is:
where
where diag{·} is the diagonal matrix whose ith diagonal element is given by the expression inside the brackets. LetA = diag{g(w^{T}x_{ i })(1  g(w^{T}x_{ i }))},
we have the matrix form of H:
With equation (6) and (8), we may estimate the parameters with Newton's method.
w_{ new }= w_{ old } H^{1}▽_{ w }l_{ γ }
We run the iteration until w_{ new } w_{ old } <δ, where δ > 0 is a small number. In each iteration, m is not fixed, but updated as the number of w_{ i } > β, where β is small positive number. Other algorithms such as the fixedHessian or conjugate gradient may also be employed to solve the above problem. The advantage of Newton's method is that it converges very fast when near the optimal solution. This algorithm converges from any initialization and a local maximum is guaranteed.
There is no regularization parameter tuning in this algorithm, but we have to set several approximation parameters to some reasonable values to allow the algorithm to be performed well. Theoretically, L_{ p }penalty gives asymptotically unbiased estimates of the nonzero parameters while shrinking the estimates of zero (or small) parameters to zero when p → 0 [16]. Unlike LASSO (with p = 1) that shrinks every parameter proportionally. Therefore the lower value of p would lead to more sparse and better solutions [12]. However when p is very close to zero, difficulties with convergence arise. Therefore, we set p = 0.1 in this paper. We also set the threshold β = 0.001. The smoothing parameter γ appears in the differentiable approximation to the L_{ p }norm. When γ is too large, the approximation is not a good one and the solution is overly smooth and the sparsity property of L_{ p }will be lost. When γ is very small, the number of iterations required for convergence increases drastically. We have found empirically that a choice of γ which does not require very many iterations, and yet converges to very sharp solutions is around 0.001–0.0000001 for our data We therefore set γ = 0.000001 in all experiments.
BLpLog employs a gradient decent method with superlinear convergence. To prevent the optimization from sticking to local optimal, we randomly initialize the coefficients 20 times and choose the estimated coefficients with the best AUC value for all of the computational experiments in this paper. Our experiments, however, have shown that the computational results are not sensitive to the parameter initialization and the algorithm converges quickly to the same optimization value most of the time with different parameter initializations.
Our binary classification algorithm can be extended for multiclass classification tasks. For a general cclass problem, we can employ the standard approach where two class classifiers are trained in order to separate each of the classes against all others. The decision rules are then coupled by voting, that is, sending the sample to the class with the largest probability.
Similarity and Integrated Algorithm
In gene selection and pathway discovery, we have to deal with two important problems: (1) how to take into account the genegene correlations and (2) how to hunt the upstream regulatory genes. As we discussed in the previous section, current gene selection methods can only select the downstream genes with bigger changes in expression. The characteristic of the regulatory genes is that their gene expression changes may be low but they are highly correlated with the downstream highly expressed genes. We first introduce our own correlation (similarity) measure (R) for continuous variables such as gene expression data.
where cov(x,y)={\displaystyle \sum ({x}_{i}\overline{x}){({y}_{i}\overline{y})}^{T}} is the standard covariance and var(x)={\displaystyle \sum ({x}_{i}\overline{x}){({x}_{i}\overline{x})}^{T}} is the variance. Based on this definition we have R(x, y) = R(y, x), and R = 0 when x and y are independent and R ≥ 1, when either var(x) ≤ cov(x, y) or var(y) ≤ cov(x, y). It is clear that this definition of R is different from the standard correlation coefficient r=cov(x,y)/\sqrt{var(x)var(y)} in its denominator. However, R can catch the the genes that have very small changes in expression but are highly correlated with the significantly expressed (downstream) genes. For instance, given cov(x, y) = 0.01, var(x) = 0.01, and var(y) = 1, we have both R = 1 and r = 0.1. Our definition of R guarantees that the upstream regulatory genes and the downstream genes will be in the same group.
The Integrated Algorithm
We now incorporate the correlation structure into the gene selection and pathway identification algorithm. Given a set of n independent observations {\{{x}_{i},{y}_{i}\}}_{i=1}^{n}:

1.
The gene selection step: Identifying a small subset of individual genes that are associated with cancer using L_{ p }regularized Bayesian logistic regression

2.
For each selected gene x_{ i }, find all x_{ j }from the original data set, such that R(x_{ i }, x_{ j }) ≥ h, where h is a threshold and set to 0.9 for experiments in this paper.

3.
For each subset of selected genes, identify the pathways associated with it using DAVID.
References
Bo T, Jonassen I: New feature subset selection procedures for classification of expression profiles. Genome Biology 2002, 3(4):0017. 10.1186/gb200234research0017
Chow M, Moler E, Mian I: Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiol Genomics 2001, 5: 99–111.
Rivals I, Personnaz L: MLPs (MonoLayer Polynomials and MultiLayer Perceptrons) for Nonlinear Modeling. Journal of Machine Learning Research 2003, 3: 1383–1398. 10.1162/153244303322753724
Zhang H, Ahn J, Lin X, Park C: Gene selection using support vector machines with nonconvex penalty. Bioinformatics 2006, 22(1):88–95. 10.1093/bioinformatics/bti736
Tipping M: Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 2001, 1: 211–244. 10.1162/15324430152748236
Li Y, Campbell C, Tipping M: Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics 2002, 18: 1332–1339. 10.1093/bioinformatics/18.10.1332
Tibshirani R: Regression shrinkage and selection via the lasso. J Royal Statist Soc B 1996, 58(1):267–288.
Efron B, Johnstone I, Hastie T, Tibshirani R: Least angle regression. Annals of Statistics 2004, 32(2):407–499. 10.1214/009053604000000067
Zou H, Hastie T, Tibshirani R: On the 'Degree of Freedom' of the Lasso. Annals of Statistics 2007, 35(5):2173–2192. 10.1214/009053607000000127
Shevade S, Keerthi S: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 2003, 19: 2246–2253. 10.1093/bioinformatics/btg308
Cawley G, Talbot N: Gene selection in cancer classification using sparse logistic regression with Bayesian regularisation. Bioinformatics 2006, 22(19):2348–2355. 10.1093/bioinformatics/btl386
Liu Z, Jiang F, Tian G, Wang S, Sato F, Meltzer S, Tan M: Sparse logistic regression with Lp penalty for biomarker identification. Statistical Applications in Genetics and Molecular Biology 2007, 6(1):Article 6. 10.2202/15446115.1248
Segal M, Dahlquist K, BR C: Regression approaches for microarray data analysis. Journal of Computational Biology 2003, 10: 961–980. 10.1089/106652703322756177
Klebanov L, Jordon C, Yakovlev A: A new type of stochastic dependence revealed in gene expression data. Statistical Applications in Genetics and Molecular Biology 2006, 5(1):Article 7.
Rapaport F, Zinovyev A, Dutreix M, Barillot E, Vert J: Classification of microarray data using gene networks. BMC Bioinformatics 2007, 8: 35. 10.1186/14712105835
Knight K, Fu W: Asymptotics for Lassotype estimators. Annals of Statistics 2000, 28: 1356–1378. 10.1214/aos/1015957397
van't Veer L, Dai H, Vijver M, He Y, Hart A, Mao M, Peterse H, Kooy K, Marton M, Witteveen A, Schreiber G, Kerkhoven R, Roberts C, Linsley P, Bernards R, Friend S: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):484–485. 10.1038/415530a
Lizuka N, Oka M, YamadaOkabe H, Nishida M, Maeda Y, Mori N, Takao T, Tamesa T, Tangoku A, Tabuchi H, Hamada K, Nakayama H, Ishitsuka H, Miyamoto T, Hirabayashi A, Uchimura S, Hamamoto Y: Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection. The Lancet 2003, 361: 923–929. 10.1016/S01406736(03)127754
Nutt C, Mani D, Betensky R, Tamayo P, Cairncross J, Ladd C, Pohl U, Hartmann C, McLaughlin M, Batchelor T, Black P, von Deimling A, Pomeroy S, Golub T, Louis D: Gene expressionbased classification of malignant gliomas correlates better with survival than histological classification. Cancer Research 2003, 63(7):1602–1607.
Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J: Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles. PNAS 2005, 101: 9309–9314.
MacKay D: Comparison of approximate methods for handling hyperparameters. Neural Computation 1999, 11(5):1035–1068. 10.1162/089976699300016331
Williams P: Bayesian regularization and pruning using a Laplace prior. Neural Computation 1995, 7(1):117–143. 10.1162/neco.1995.7.1.117
Acknowledgements
We thank the Associate Editor and the two anonymous referees for their constructive comments which helped improve the manuscript. ZL was partially supported by grant 1R03CA12810202 from the National Institute of Health.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
ZL conceptualized and designed method, developed the software, and wrote the manuscript. RG and FJ analyzed and interpreted the data on its biological contents. MT helped in method design and manuscript writing and revised the manuscript critically. XJ did the actual computations. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Liu, Z., Gartenhaus, R.B., Tan, M. et al. Gene and pathway identification with L_{ p }penalized Bayesian logistic regression. BMC Bioinformatics 9, 412 (2008). https://doi.org/10.1186/147121059412
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121059412