- Methodology article
- Open Access
Regularized gene selection in cancer microarray meta-analysis
BMC Bioinformatics volume 10, Article number: 1 (2009)
In cancer studies, it is common that multiple microarray experiments are conducted to measure the same clinical outcome and expressions of the same set of genes. An important goal of such experiments is to identify a subset of genes that can potentially serve as predictive markers for cancer development and progression. Analyses of individual experiments may lead to unreliable gene selection results because of the small sample sizes. Meta analysis can be used to pool multiple experiments, increase statistical power, and achieve more reliable gene selection. The meta analysis of cancer microarray data is challenging because of the high dimensionality of gene expressions and the differences in experimental settings amongst different experiments.
We propose a Meta Threshold Gradient Descent Regularization (MTGDR) approach for gene selection in the meta analysis of cancer microarray data. The MTGDR has many advantages over existing approaches. It allows different experiments to have different experimental settings. It can account for the joint effects of multiple genes on cancer, and it can select the same set of cancer-associated genes across multiple experiments. Simulation studies and analyses of multiple pancreatic and liver cancer experiments demonstrate the superior performance of the MTGDR.
The MTGDR provides an effective way of analyzing multiple cancer microarray studies and selecting reliable cancer-associated genes.
Microarrays are capable of profiling human tissues on a genome-wide scale and have been used extensively in cancer studies, where expressions of thousands of genes are measured along with clinical outcomes. A major goal of such studies is to identify a subset of cancer-associated genes that can be used as biomarkers for cancer diagnosis and prognosis and as targets for therapy. Early studies have shown that gene signatures identified from the analysis of individual cancer microarray experiments often have low reproducibility. There are several reasons for this. A main one is that the sample size of a single microarray experiment, which is usually in the hundreds, is much smaller than the number of genes, which is usually in the tens of thousands.
Within the field of clinical investigation, meta analysis has emerged as the gold standard for the comparison and combined analysis of clinical studies. It is generally accepted that only meta analysis can circumvent the problems inherent to studies with low statistical powers due to low sample sizes . With meta analysis, it is usually not the intention of researchers to analyze any new datasets. Rather, it provides an effective way of pooling and analyzing multiple existing datasets and generating results more reliable than those from the analysis of each individual data set.
Meta analysis of cancer microarray data is made possible by the many experiments conducted independently to measure the same set of genes and the same cancer clinical outcomes. As shown in [2–5], the meta analysis of cancer microarray data has achieved considerable successes by identifying relatively reproducible, biologically meaningful gene signatures. We refer to  for more discussions of the merits of meta analysis in genomic studies.
Meta analysis of cancer microarray data is challenging because (1) microarray experiments usually measure a small number of samples and a large number of genes, with only a subset of those genes associated with cancer clinical outcomes. Gene selection is needed along with estimation; (2) the meta analysis of cancer microarray data and the identification of cancer-associated genes often require the use of original expression measurements. For this reason, the type of analysis conducted in this article has also been referred to as "integrative analysis". Such analysis differs significantly from conventional meta analysis, where the analysis is based on summary statistics (such as p-values) from each individual experiment; and (3) different platforms may be used in different experiments. Arrays that hybridize one sample at a time (e.g., synthesized oligonucleotide arrays) measure gene expression based directly on the signal intensity of each probe set. In contrast, spotted cDNA arrays hybridized with fluorescent-labeled targets typically measure the ratio of the signal from a test sample to the signal of a co-hybridized reference sample. It has been shown that data from Affymetrix GeneChip oligonucleotide microarrays correlate poorly with the data from custom-printed cDNA microarrays . We note here that comparability of different platforms can be achieved by the transformation of the expressions. However, as noted in previous studies (such as ), such transformation needs to be conducted on a case-by-case basis.
Several approaches have been proposed to analyze the marginal effects of genes using data from multiple microarray experiments. Examples of this include Fisher's approach (with application to breast cancer ); an intensity approach that transforms and directly integrates gene expressions ; a penalization approach ; a random effect model based approach ; a robust gene ranking approach ; and a Bayesian approach .
In light of the fact that cancer development and progression are caused by the effects of multiple genes, the following studies (which can account for the joint effects of genes) have been conducted. A majority voting (with impact factors) approach has been proposed by . Gene shaving approaches based on random forrest and Fisher's linear discrimination are applied in . And a computationally intensive Bayesian approach is proposed in . We note that the focus of those studies has been predictive model building, not gene selection.
On the other hand, there is rich literature for the analysis of a single cancer microarray data and gene selection. Examples include the parameterized classifier design approach in ; the penalization approaches in [17, 18]; the Threshold Gradient Directed Regularization (TGDR) approach [19–21]; and the support vector machine approach . We refer to  for more discussions of gene selection approaches with individual microarray datasets. We note, however, that those approaches have been designed to analyze a single dataset, and cannot be used to analyze multiple, heterogeneous datasets.
The literature review suggests that (1) genes identified from analysis of a single cancer microarray data may suffer from low reproducibility because of the small sample size. Meta analysis pools multiple datasets, increases statistical power, and provides an effective way of improving reproducibility; (2) existing meta analysis approaches focus on either the investigation of the marginal effects of genes or the construction of predictive models with multiple genes; and (3) approaches exist that can select genes with joint effects on cancer in the analysis of a single dataset. However, these approaches cannot be used to analyze multiple, heterogeneous data. Thus, there is a critical need for approaches that can select genes with joint effects on cancer in the meta analysis of multiple microarray data.
In this article, we propose the Meta Threshold Gradient Descent Regularization (MTGDR) approach for gene selection in cancer microarray meta analysis. The MTGDR takes advantage of recent developments in regularized gene selection with a single microarray dataset. Compared to such single-dataset gene selection methods, the MTGDR has the desired flexibility of accommodating multiple experiments with different setups. And in comparison with the available meta analysis methods, the MTGDR can effectively select a subset of genes with joint effects on cancer.
Results and discussion
We conduct simulation studies to investigate the performance of the proposed MTGDR. We generate M = 3 datasets. For dataset m = 1, 2 and 3, we generate n m samples and expressions of d genes. Gene expressions are generated in a way that all expressions have marginally normal distributions with unit variance, and the correlation between the expressions of genes i and j is 0.4|i-j|. In each dataset, the first 20 genes are associated with the cancer outcome. Specifically, for genes i = 1, ..., 20, the mean expressions of the n m /2 cases (outcome Y m = 1) are generated randomly from Uniform[l, u]. The mean expressions for the genes of the controls (outcome Y m = 0) are zero. The mean expressions for the genes not associated with the outcomes are zero. The simulation setting here corresponds to the logistic regression models for all three datasets. The regression coefficients for the cancer-associated genes vary across studies, which corresponds to different experimental setups (for example different platforms) in different studies.
We consider combinations of the following simulation settings: (1) sample size n m = 30 and 100; (2) number of genes d = 100, 500 and 1000; and (3) different levels of "signals" [l, u] = [0.5, 1.0] and [1.0, 1.5]. Thus, there are a total of 12 different simulation scenarios.
We employ the proposed MTGDR, and tuning parameters are selected via the 3-fold cross validation. For comparison, we also consider the following two alternative approaches: (1) the pooled TGDR approach. Other than the differences in regression coefficients (shifts of mean expressions), the three datasets are generated in a comparable manner. We pool all three datasets together, treat them as if they were from a single experiment, and analyze them with the TGDR approach; and (2) the meta analysis approach based on individual TGDR analysis. We first analyze each dataset using the TGDR approach. We then search for genes identified in all three studies. This corresponds to the meta analysis approach where each dataset is analyzed separately using the TGDR and the results are combined via a voting approach. We note that other alternative approaches exist. For example, it is possible to replace the TGDR approach with the penalization approaches discussed in . Early studies have established the comparable performance of the TGDR with alternative approaches [19–21]. Since the proposed MTGDR shares a similar thresholding paradigm with the TGDR, we focus on the aforementioned two alternatives.
In Table 1, we show the mean (standard deviation) of the number of identified genes and the number of true positives based on 200 replicates. We can see that (1) the proposed MTGDR is capable of identifying the majority of the genes truly associated with the outcome and has very small false positive rates; (2) the performance of the pooled analysis is less satisfactory but still acceptable. We note that the three simulated datasets are more comparable than those encountered in practical studies. The regression coefficients differ across datasets, though the differences are small. This comparability explains the reasonable performance of the pooled analysis and should not be expected in general with practical data; and (3) the "individual TGDR + voting" meta analysis approach has inferior performance, which is caused mainly by the small sample size and the subsequent lack of reproducibility of each individual dataset. We have also conducted simulations under other settings and drawn similar conclusions (results not shown).
Pancreatic cancer study
Pancreatic ductal adenocarcinoma (PDAC) is a major cause of malignancy-related deaths. Apart from surgery, there is still no effective therapy, and even resected patients die usually within one year postoperatively. Several experiments have been conducted using microarrays to identify pancreatic cancer genomic markers. In our study, we gather and analyze four studies, which are first reported in [24–27]. These four datasets have also been analyzed by , and it has been argued that the clinical settings in the four studies are comparable. Thus, it is reasonable to conduct meta analysis with such data. We show the data descriptions in Table 2. Two of the four studies use cDNA arrays, and two use oligonucleotide arrays. Cluster ID and gene names are assigned to all of the cDNA clones and Affymetrix probes based on UniGene Build 161. The two sample groups considered in our analysis are PDAC and normal pancreatic tissues. Data on chronic pancreatitis are available for [25, 27], but will not be used in our analysis.
For each dataset, data processing (including normalization) has been separately conducted by researchers in each individual study. We identity a consensus set of 2984 UniGene IDs. We remove genes with more than 30% missingness in any of the four datasets. There are 1204 genes remained for downstream analysis. For each data separately, if Affymetrix is used, we first add a floor of 10 and make log2 transformations of the expressions. We then fill in missing values with medians across samples and standardize each gene expression to have zero mean and unit variance.
In the MTGDR analysis, tuning parameters are chosen via the 3-fold cross validation. Fifteen genes are identified as being associated with the risk of developing pancreatic cancer. We show the gene IDs and corresponding estimates in Table 3. We can see that if a gene has a nonzero coefficient in one dataset, then it has nonzero coefficients in all datasets (which indicates that this gene is identified in all studies). We also note that the estimated coefficients for one gene can be different across studies. This is the extra flexibility allowed by the MTGDR over the pooled analysis, which naturally accommodates differences among experimental setups in different studies. Furthermore although the estimated coefficients may be different for one gene across experiments, their signs are the same. The same signs lead to similar biological conclusions (i.e., whether up-regulations of the genes are positively or negatively associated with the risk of developing cancer).
We evaluate the biological implications of selected genes by surveying  and other public databases. Among the 15 genes, several have been previously identified in independent studies. Specifically, gene Hs.107 (Fibrinogen-like 1) is a member of the fibrinogen family. In large scale proteomic analysis of serum samples, certain members from the fibrinogen family have been found to be over-expressed in pancreatic cancer samples . Gene Hs.12068 (Carnitine acetyltransferase) is a key enzyme in the metabolic pathway in mitochondria, peroxisomes, and endoplasmic reticulum. CRAT catalyzes the reversible transfer of acyl groups from an acyl-CoA thioester to carnitine and regulates the ratio of acylCoA/CoA in the subcellular compartments. In addition, CRAT has been found to be significantly under-expressed in PDAC samples . Gene Hs.169900 (PABPC4) is localized primarily in the cytoplasm. It may be necessary for the regulation of stability of labile mRNA species in activated T cells. It is one of the pancreatic cancer biomarkers identified in , where it is down-regulated at least four-fold in four or more PDAC specimens.
Gene Hs.180920 (RPS9 ribosomal protein S9) encodes a ribosomal protein that is a component of the 40S subunit. The protein belongs to the S4P family of ribosomal proteins. Crnogorac-Jurcevic et al.  was the first to identify the association between the dysregulated expression of PRS9 and PDAC. Gene Hs.287820 (fibronectin 1) encodes fibronectin, a glycoprotein present in a soluble dimeric form in plasma and in a dimeric or multimeric form at the cell surface and in the extracellular matrix. Fibronectin plays an important role in maintaining the structural integrity of the pulmonary epithelium and endothelium. Decreases in serum fibronectin and increases in pulmonary leukocyte margination during acute pancreatitis may compromise the integrity of the air-blood barrier and also increase the pulmonary uptake of circulating pathogenic materials. Gene Hs.317432 (BCAT1) encodes the cytosolic form of the enzyme branched-chain amino acid transaminase. This enzyme catalyzes the reversible transamination of branched-chain, alpha-keto acids to branched-chain, L-amino acids essential for cell growth. It is one of the pancreatic cancer markers broadly identified . Gene Hs.5591 (MKNK1) belongs to the MAPK pathway, which has been identified to be associated with the development of multiple cancers. Protein encoded by gene Hs.62 (PTPN12) is a member of the protein tyrosine phosphatase (PTP) family. PTPs are known to be signaling molecules that regulate a variety of cellular processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation. As has been pointed out by , gene Hs.75335 (GATM) has been identified as a pancreatic cancer marker in multiple independent studies. Gene Hs.78225 (NBL1) is located at chromosome 1p36. Deletion of material from this region is common in ineuroblastoma. It is possible that a tumor suppressor gene is present in this region.
Ideally, statistical evaluations of the MTGDR should be based on independent data, though it is often unavailable. As an alternative, we conduct evaluations using the following Leave-One-Out (LOO) approach, which has been adopted extensively in cancer microarray studies. We first remove one subject from the dataset. With the reduced dataset, we compute the MTGDR estimate. We note that, to get a relatively fair evaluation, a new set of tuning parameters needs to be computed for the reduced dataset. With the MTGDR, we are able to obtain one regression model for each individual dataset. Then using the model for the dataset that the removed subject belongs to, we are able to predict the probability and class membership (by dichotomizing the predicted probability at 0.5) for the removed subject. We repeat this procedure over all subjects and compute the classification error. With the LOO approach, the MTGDR misclassifies 2 subjects in data P3; otherwise, it achieves perfect classification.
Analyses With Alternative Approaches
To facilitate a more comprehensive understanding of the MTGDR approach and the pancreatic study, we conduct the following additional analyses.
ANALYSIS WITH THE POOLED TGDR APPROACH
As in the simulation study, we ignore the fact that the four datasets are from different studies that use different platforms. We pool the four datasets and analyze them using the TGDR approach. The sample size of the pooled dataset is 56. A total of 22 genes are identified using this approach. Specifically, this approach identifies 13 of the 15 genes identified by the MTGDR and misses genes BCAT1 and NBL1. As discussed in the above section, both of those two genes have important implications in pancreatic cancer development. (More detailed information on gene identification using this approach is available upon request.) We also evaluate performance of the pooled approach using the LOO. Two subjects (1 in P3 and 1 in P4) are not properly classified.
META ANALYSIS BASED ON INDIVIDUAL TGDR
We first analyze each dataset using the TGDR approach and then search for genes identified in multiple studies. This is a voting-based meta analysis approach. For the four datasets, TGDR identifies 7 (P1), 10 (P2), 6 (P3), and 1 (P4) genes, respectively. The numbers of overlaps with genes identified using the MTGDR are 1, 1, 2, and 0, respectively. There is only 1 gene identified with both P2 and P3. Otherwise, there is no overlap between genes identified with the four datasets. Genes identified in one study cannot be used to satisfactorily predict subjects in other studies. For example, we use genes identified in P2 and the corresponding logistic model to make predictions for the rest of the three datasets. Four (P1), 6 (P3), and 4 (P4) subjects cannot be properly classified.
META ANALYSIS OF MARGINAL EFFECTS
With the MTGDR and the two alternative approaches, we search for genes with joint effects on pancreatic cancer development. To provide a more comprehensive analysis of the pancreatic data, we conduct the following analysis of marginal effects. Since the pancreatic data have the "normal versus cancer" binary setup, for each dataset and each gene, we conduct the two-sample comparison of expressions of normal versus cancer samples using the t-test and compute the p-value. For each gene, we combine the p-values across four studies using the Fisher's approach . We then rank genes using the p-values from the meta analysis. Genes with smaller combined p-values have smaller ranks. We note that this is the conventional meta analysis approach for data with binary outcomes. With this approach, we investigate the marginal associations between each individual genes and the cancer outcome. We show the ranks for the MTGDR-identified genes in Table 3. We can see that several MTGDR-identified genes have very low ranks. Specifically, genes with marginal ranks 1–7 are identified using the MTGDR. However, there are also MTGDR-identified genes with very high ranks. For example, genes Hs.317432, Hs.5591, and Hs.62 have ranks 144, 56, and 50, respectively. Our analysis suggests that meta analysis and identification of genes with joint effects cannot be replaced with meta analysis of marginal effects.
Liver cancer study
Gene expression profiling studies have been conducted on hepatocellular carcinoma (HCC), which is among the leading causes of cancer deaths in the world. We conduct meta analysis using the four liver cancer microarray datasets described in . Detailed data information is provided in Table 4, where the four datasets are referred to as D1–D4, respectively. The four datasets were generated in three different hospitals in South Korea. Although the studies were conducted in a controlled setting, Choi et al.  "failed to directly merge the data even after normalization of each dataset."
In studies D1–D3, expressions of 10336 genes were measured. In study D4, expressions of 9984 genes were measured. We focus on the 9984 genes measured in all four studies. For each dataset, the within-print-tip-group normalization is first carried out. We then process the data as follows:
(1.1) if a gene has more than 30% of missingness in any dataset, it is removed from downstream analysis. In total, 3122 out of 9984 genes pass this screening.
(1.2) if a subject has more than 30% missing expressions for the 3122 genes, then this subject is removed. Eight subjects are removed, leading to an effective sample size of 125. We show the number of subjects actually used in the analysis in Table 4.
For each dataset, we fill in missing expression values with medians across samples.
Supervised screening: for each dataset, we compute the two-sample t-statistic for each gene. We then assign a rank to each gene based on the t-statistic. The overall rank for one gene is defined as the sum of ranks across all four datasets. One thousand genes with the lowest ranks are selected for downstream analysis. This rank-based screening shares similar spirits as the one in .
For each dataset, we normalize each gene expression to have zero mean and unit variance.
Gene screening is conducted to exclude genes which are very unlikely to be cancer-associated. Similar procedures have been adopted in  and others.
We employ the MTGDR approach with optimal tuning parameters selected using the 3-fold cross validation. Thirty-four genes are identified as being associated with the risk of developing liver cancer. We provide information and corresponding estimates for identified genes in Table 5. We draw similar conclusions from Table 5 as from Table 3. We note that, for a very small number of genes, the signs of the four estimates are different. For example, for gene 15.4.E1/Rab9 effector p40, three out of four estimated coefficients are positive, and one is negative. The negative coefficient has a small absolute value and can be caused by random variations. Different signs may suggest conflicting biological conclusions. Without having access to the original experimental setup or a gold standard, we are unable to make further explanations of the conflicting signs. Although those genes have been identified with the MTGDR, they should be interpreted with extreme caution because of those conflicting signs.
We search public databases for independent evidence of associations between identified genes and liver cancer development. Among the identified genes, gene KIAA0406 is one that constitutes the predictor of PI3 kinase activation. The PI3 kinase signaling pathway is emerging as a promising therapeutic target in a number of cancers as well as inflammation and heart diseases. It has been found in a rat experiment that the mRNA and protein levels of Cyt19 are higher in the liver than in other tissues. Gene Rab9 belongs to the RAS oncogene family, which is activated in multiple cancers. ATPases are a class of enzymes that catalyze the decomposition of adenosine triphosphate (ATP) into adenosine diphosphate (ADP) and a free phosphate ion. This dephosphorylation reaction releases energy, which the enzyme (in most cases) harnesses to drive other chemical reactions that would not otherwise occur. RalGDS is an oncogene and can induce transformation and gene expression by activating Ras, Ral, and Rho mediated pathways. The combination of TPI and an antitumor nucleoside, FTD, not only enhances the antitumor efficacy and decreases the toxicity of FTD, but it also suppresses TP-induced angiogenesis. Protein encoded by ADFP is a major constituent of the globule surface. Increases in mRNA levels are one of the earliest indications of adipocyte differentiation. The Human G protein-coupled receptor has been found expressed in lung, heart, and lymphoid tumor tissues. MEN-1 is a cancer predisposition gene and has been found to be activated in pancreatic, ovarian, and male breast cancers. Polyspecific organic cation transporters in the liver, kidney, intestine, and other organs are critical for the elimination of many endogenous small organic cations as well as a wide array of drugs and environmental toxins. Gene SLC22A1 is one of three similar cation transporter genes located in a cluster on chromosome 6. Mutations of gene TUBB have been found in breast and non-small cell lung cancers. Gene H2AFZ encodes a replication-independent member of the histone H2A family that is distinct from other members of the family. Studies in mice have shown that this particular histone is required for embryonic development and revealed that the lack of functional histone H2A can lead to embryonic lethality. This gene encodes a member of the Asp-Glu-Ala-Asp (DEAD) box protein family. Members of this family are believed to be involved in embryogenesis, spermatogenesis, and cellular growth and division.
We conduct statistical evaluations using the LOO approach described above. The MTGDR misclassifies 6 (D1), 8 (D2), 4 (D3), and 2 (D4) subjects, respectively, which leads to an overall classification error of 0.16. We note that, supervised screening has been conducted prior to analysis. To make a fair evaluation, in the LOO procedure, we carry out the supervised screening for each reduced data (with one subject removed) separately. The possibility of overly optimistic evaluation can be minimized.
Analyses with alternative approaches
As for the pancreatic study, we conduct the following analyses using alternative approaches.
ANALYSIS WITH THE POOLED TGDR APPROACH
We pool the four datasets, which have a combined sample size of 125, and analyze with the TGDR approach. This pooled approach identifies 24 out of the 34 genes identified by the MTGDR, misses 10, and identifies 10 extra genes not identified by the MTGDR. (Detailed information on gene identification using this approach is available upon request.) We also evaluate the performance of this pooled approach using the LOO. Six (D1), 13 (D2), 11 (D3), and 6 (D4) subjects are not properly classified, which leads to an overall classification error of 0.29.
META ANALYSIS BASED ON INDIVIDUAL TGDR
We analyze each individual dataset using the TGDR approach and then search for overlaps of identified genes. For the four datasets, the TGDR identifies 27 (D1), 10 (D2), 20 (D3), and 6 (D4) genes. The numbers of overlaps with genes identified using the MTGDR are 4, 4, 3, and 1. Among the identified genes, one is identified in three datasets, another one is identified in two datasets, and the remainder are identified in only one. Genes identified using one dataset cannot be used to make satisfactory predictions for other datasets. For example, when genes identified with D1 and the corresponding logistic regression model are used to predict subjects in the rest of the three datasets, 20 (D2), 8 (D3), and 6 (D4) subjects cannot be properly classified.
META ANALYSIS OF MARGINAL EFFECTS
We conduct meta analysis of the marginal effects as described in the pancreatic cancer study. In Table 5, we show the marginal ranks of the MTGDR-identified genes. A few MTGDR-identified genes also have very strong marginal effects. Specifically, genes with marginal ranks 1 and 3 are identified with the MTGDR. On the other hand, there are several MTGDR-identified genes with very high marginal ranks.
For many types of cancers, multiple microarray experiments have been independently conducted to search for genes associated with the same clinical outcomes. Early studies have suggested that genes identified from the analysis of a single cancer microarray dataset may have low reproducibility. Among the several possible causes are the small sample sizes and lack of statistical power. A cost effective solution is to pool multiple existing datasets with similar study designs and conduct meta analysis. The merits of meta analysis with cancer microarray data have been established in many early studies and summarized in . In this article, we have developed a new gene selection method in the meta analysis of multiple cancer microarray data.
In terms of methodology, the MTGDR differs significantly from existing approaches. Compared to most existing meta analysis approaches, the MTGDR focuses on the selection of genes with joint effects on cancer and embeds gene selection in estimation. Thus, it can complement existing meta analysis of marginal effects and help to provide a more comprehensive description of the effects of genes. When compared to pooled analysis, the MTGDR allows for experiment-specific regression coefficients. Such a strategy shares similar spirits as the random effects approaches in conventional meta analysis. However, existing random effects approaches are designed for data with a small number of covariates and do not have built-in gene selection mechanisms. The MTGDR advances from such approaches by incorporating gene selection in modeling. It can automatically accommodate different experimental setups, especially different platforms. Compared to intensity approaches that seek for transformations of gene expressions, the MTGDR does not need be conducted on a case-by-case basis. In comparison to classic meta analysis approaches, the MTGDR pools and analyzes raw data instead of summary statistics and can be more informative. In addition, the MTGDR puts more emphasis on gene selection.
Our simulation studies suggest that the MTGDR outperforms the meta analysis approach based on an individual dataset gene selection method. More specifically, it is capable of identifying the same number or more of the true positives with a lower false positive rate. In addition, performance of the MTGDR is relatively insensitive to the increase of the number of genes. Analyses of pancreatic and liver cancer studies suggest that (a) the MTGDR is capable of identifying a small number of genes that show relatively consistent effects on cancer outcomes across multiple studies; (b) many of the identified genes have been confirmed in independent studies. The LOO evaluation generates small classification errors; (c) the gene sets identified by the MTGDR can be considerably different from those identified by alternative approaches. Alternative approaches have inferior performance in terms of inconsistency of identified genes across multiple studies and larger classification errors; and (d) genes identified using the MTGDR may differ significantly from genes with low ranks in the meta analysis of marginal effects.
Despite its significant advancements over existing approaches, our study may have the following limitations. First, in the analysis of the liver data, inconsistent signs for a small number of genes are observed. Such inconsistency is not observed in the pancreatic data analysis or the simulation. It is possible to modify the MTGDR algorithm and force the signs to be the same across multiple studies. For example, for a specific gene, suppose that one gradient is small and negative, and the other three gradients are large and positive. We can add an additional thresholding and set the negative gradient to be zero. We choose to allow inconsistent signs, which may help raise an alarm on the comparability of data and the applicability of the proposed approach when such inconsistency is observed. Second, in our data analysis, we are able to provide partial interpretations of the identified genes. Many of these have been confirmed in independent studies. However, for the liver cancer data, detailed information on several identified genes is not available. Since the focus of this study is to develop a new meta analysis approach, we do not further pursue the biological implications of the analysis results. Third, in the analysis, we evaluate the performance of the MTGDR using the LOO approach. With properly utilized cross validation, the evaluation and comparison with other approaches are expected to be reasonably fair. In standard logistic regression analysis, when the sample size is much larger than the number of genes, there are several other ways of evaluating the fitted model and selected covariates. For example, p-values and R2 can be computed. However, we note that the validity of those evaluation criterions is established under the "sample size >> number of covariates" setting and is not applicable to the microarray data, where the number of genes is much larger than the sample size. To our best knowledge, there is still no consensus on evaluation methods with cancer microarray meta analysis.
Data and model
For simplicity of notation, we assume that the same set of d genes are measured in all M different experiments with M > 1. When different sets of genes are measured in different experiments, the MTGDR is still applicable by setting the expressions of missing genes as zero. Note that meta analysis can be less powerful when the number of genes measured in all studies decreases. For 1 ≤ m ≤ M, let Ymdenote the clinical outcomes and Zmdenote the gene expressions in the m th experiment. For each experiment, we assume a regression model Y m ~ ϕ(Zm' βm), where βmis the regression coefficient, Zm' denotes the transpose of Zm, and ϕ is the known link function. By considering the joint modeling of multiple genes, we are able to account for the joint effects of genes on the clinical outcomes.
We assume the same link function ϕ across different experiments. This assumption has generally been made in meta analysis. However, we allow for different regression cofficients βmand, hence, different models under different experiments. Such a strategy has been motivated by the fixed effect models in meta analysis . The rationale is that a one unit gene expression change in experiment 1 (say, for example, a cDNA study) may not be equivalent to a one unit change in experiment 2 (say, for example, an Affymetrix study). The regression coefficients, which measure the strength of associations, should be allowed to differ.
We choose data with binary outcomes to describe the proposed MTGDR. We note that this method is also applicable to other types of cancer clinical outcomes, as long as statistical models and objective functions can be properly defined. For experiment m and binary outcome, Ym= 1 and Ym= 0 may denote the presence and absence of cancer or two different cancer stages, respectively. We assume the commonly used logistic regression model, which postulates that the logit of the conditional probability logit(P(Ym= 1|Zm)) = αm+ Zm' βm, where αmis the unknown intercept.
Suppose that there are n m iid observations in experiment m. The log-likelihood is:
Since the intercept αmis usually of little interest, for simplicity, we rewrite Rm(αm, βm) as Rm(βm).
The MTGDR is a gene selection method. It embeds gene selection in the construction of regression models. Gene selection then amounts to identifying the nonzero components of the regression coefficients βm.
Under the present setup, it is natural to make the following assumptions: (S1) The sets of genes with nonzero coefficients (i.e., the identified cancer-associated genes) are the same across different experiments. Under meta analysis, we expect certain comparability of multiple studies. Thus, although data generated under different experiments are not directly comparable, the biological conclusions should be comparable. In other words, we should conclude that the same sets of genes are associated with cancer across different experiments; (S2) Although similar logistic regression models are used to link genes with cancer outcomes in all experiments, the nonzero components of the regression coefficients βmmay be not equal across experiments. This assumption is due mainly to the concern of different experimental setups, especially platforms.
Let β = (β1, ..., βM) and R(β) = R1(β1) + ... + RM(βM). Here β is a d × M matrix. Let Δν be a small positive increment, as in ordinary gradient descent searching. In the implementation of this algorithm, we choose Δν = 10-3. Let βm(ν) denote the parameter estimate of βmcorresponding to ν. Let 0 ≤ τ ≤ 1 be a fixed threshold value. The MTGDR algorithm proceeds as follows.
Initialize β = 0 (component-wise) and ν = 0.
With current estimate β, compute the d × M negative gradient matrix g(ν) = -∂R(β)/∂β, where the (j, m) element of g is .
Compute the length d vector of meta gradient G, where the jthcomponent of G is .
Compute the meta threshold vector F(ν) of length d, where the jthcomponent of F(ν): F j (ν) = I(|G j (ν)| ≥ τ × max l |G l (ν)|) and I is the indicator function.
Update the (j, m) element of β: βj, m(ν + Δν) = βj, m(ν) - Δνgj, m(ν)F(ν) and update ν by ν + Δν.
Steps 2–5 are iterated k times, where k is determined by cross validation.
In Step 1, the MTGDR algorithm starts with the zero estimates (i.e., no gene is identified as cancer-associated). In Step 2, the gradients are computed for each individual dataset. Genes with stronger effects on cancer outcomes will have larger gradients. In Step 3, the meta gradient, which is defined as the sum across different experiments, is computed. It evaluates the overall effects of genes on cancer outcomes across multiple experiments. For example, consider that gene 1 shows only a large positive effect in experiment 1 and no effects in other experiments, whereas gene 2 shows moderate negative effects in all experiments. Then the sum of gradients for gene 2, which measures the overall effect across multiple experiments, may be larger than that for gene 1. Gene 2 is thus more likely to be selected since consistent effects are demonstrated across experiments. In Step 4, a meta threshold vector is computed. With this vector, when a gene is selected, it is selected in all models across multiple experiments. In Step 5, we update the MTGDR estimates for only those selected genes. In addition, by allowing for different gradients across multiple studies, the MTGDR allows for different estimates (and, hence, different models) for different experiments.
The tuning parameters τ and k jointly determine the property of β and the property of gene selection. When τ ≈ 0, β is dense even for small values of k (i.e, many genes are selected). When τ ≈ 1, β is sparse for small k and remains so for a relatively large number of iterations. But it will become dense eventually. At the extreme, when τ = 1, the MTGDR usually updates estimates for a single gene at each iteration, which is similar to the stage-wise approaches. When τ is in the middle range, the characteristics of β are between those for τ = 0 and τ = 1. For τ ≠ 0, gene selection can be achieved with cross-validated, finite k by having certain components of β exactly equal to zero.
As can be seen, the MTGDR involves only simple calculations and can be programmed with many existing software. In our study, research software has been developed using R and is available at .
The MTGDR has been partly motivated by the TGDR . The two approaches share a similar thresholding scheme. However, the MTGDR differs significantly from the TGDR by analyzing multiple datasets. When analyzing a single dataset with the TGDR, the effect of a gene can be represented by a single number – its regression coefficient. However, when multiple datasets are present, the effect of a gene needs to be considered across multiple studies and represented with a vector of regression coefficients. Loosely speaking, the TGDR conducts the selection of individual coefficients, whereas the MTGDR conducts the selection of groups of coefficients. Although intuitively simple, extension from individual selection to group selection has been shown to be highly nontrivial.
Tuning parameter selection
We use the V-fold cross validation to select the optimal k and τ. For τ = 0,0.05, ..., 0.95,1, we search over k to maximize the V-fold cross validation objective function, which can be defined following . With the V-fold cross validation, partial protection against over-fitting is also provided. In this study, we set V = 3, which is due mainly to the small sample size consideration.
A graphic demonstration
We use the following numerical example to demonstrate the MTGDR parameter paths. For m = 1, 2 and 3, we generate data from . In this simulated meta analysis, there are three independent experiments and four genes per experiment. s are generated independently and N(0,1) distributed. We set β1 = (2.0, 2.0, 0,0), β2 = (1.5, 1.5, 0,0) and β3 = (1.0, 1.0, 0,0). In all three experiments, only the first two genes are associated with the binary outcomes, and their corresponding coefficients are different. We simulate 50 observations in each experiment.
The 3-fold cross validation select τ = 1.0 and k = 620. We show in Figure 1 the parameter paths as a function of k for τ = 1.0. Individual parameter paths are similar to the stage-wise paths. We can see that for any k, the estimated coefficients for one gene are either all zero or all nonzero across experiments. For a specific gene with nonzero coefficients, the estimated coefficients are different across experiments.
Maynard A, Chalmers I: Non-random Reflections on Health Services Research. London: BMJ Publishing Group; 1997.
Choi J, Choi J, Kim D, Choi D, Kim B, Lee K, Yeom Y, Yoo H, Yoo O, Kim S: Integrative analysis of multiple gene expression profiles applied to liver cancer study. FEBS Letters 2004, 565: 93–100.
Ghosh D, Barette TR, Rhodes D, Chinnaiyan AM: Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer. Funct Integr Genomics 2003, 3(4):180–188.
Wang J, Coombes KR, Highsmith WE, Keating MJ, Abruzzo LV: Differences in gene expression between B-cell chronic lymphocytic leukemia and normal B cells: a meta-analysis of three microarray studies. Bioinformatics 2004, 20(17):3166–3178.
Warnet P, Eils R, Brors B: Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics 2005, 6: 265.
Guerra R, Allison DB, Goldstein D: Meta-analysis and Combining Information in Genetics and Genomics. Chapman & Hall/CRC; 2008.
Kuo WP, Jenssen TK, Butte AJ, Ohno-Machado L, Kohane IS: Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 2002, 18: 405–412.
Shabalin AA, Tjemeland H, Fan C, Perou CM, Nobel AB: Merging two gene-expression studies via cross-platform normalization. Bioinformatics 2008, 24: 1154–1160.
Smith DD, Saetrom P, Snove O, Lundberg C, Rivas GE, Glackin C, Larson GP: Meta-analysis of breast cancer microarray studies in conjunction with conserved cis-elements suggest patterns for coordinate regulation. BMC Bioinformatics 2008, 9: 63.
Stevens JR, George RW: Meta-analysis combines Affymetrix microarray results across laboratories. Comp Funct Genomics 2005, 6(3):116–122.
Hong F, Breitling R, McEntee CW, Witter BS, Nemhauser JL, Chory J: RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics 2006, 22: 2825–2827.
Jung Y, Oh M, Shin D, Kang S, Oh H: Identifying differentially expressed genes in meta-analysis via Bayesian model-based clustering. Biometrical Journal 2006, 48: 435–450.
Fung B, Ng V: Meta-classification of multi-type cancer gene expression data. Proceeding of 4th Workshop on Data Mining in Bioinformatics 2004, 31–39.
Jiang H, Deng Y, Chen H, Tao L, Sha Q, Chen J, Tsai C, Zhang S: Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics 2004, 5: 81.
Conlon EM, Song JJ, Liu A: Bayesian meta-analysis models for microarray data: a comparative study. BMC Bioinformatics 2007, 8: 80.
Kim S, Dougherty ER, Barrera J, Chen Y, Bittner ML, Trent JM: Strong features sets from small samples. Journal of Computational Biology 2002, 9: 127–146.
17. Ghosh D, Chinnaiyan A: Classification and selection of biomarkers in genomic data using LASSO. J Biomed Biotechnol 2005, (2):147–154.
Gui J, Li H: Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 2005, 21: 3001–3008.
Gui J, Li H: Threshold gradient descent method for censored data regression, with applications in pharmacogenomics. Pac Symp Biocomput 2005, 272–283.
Ma S, Huang J: Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 2005, 21: 4356–4362.
Ma S, Huang J: Clustering threshold gradient descent regularization: with applications to microarray studies. Bioinformatics 2007, 23: 466–472.
Zhang H, Ahn J, Lin X, Park C: Gene selection using support vector machines with non-convex penalty. Bioinformatics 2006, 22: 88–95.
Ma S, Huang J: Penalized feature selection and classification in bioinformatics. Briefings in Bioinformatics 2008, 9: 392–403.
Iacobuzio-Donahue CA, Ashfaq R, Maitra A, Adsay NV, Shen-Ong GL, Berg K, Hollingsworth MA, Cameron JL, Yeo CJ, Kern SE, Goggins M, Hruban RH: Highly expressed genes in pancreatic ductal adenocarcinomas: a comprehensive characterization and comparison of the transcription profiles obtained from three major technologies. Cancer Research 2003, 63: 8614–8622.
Logsdon CD, Simeone DM, Binkley C, Arumugam T, Greenson J, Giordano TJ, Misek D, Hanash S: Molecular profiling of pancreatic adenocarcinoma and chronic pancreatitis identifies multiple genes differentially regulated in pancreatic cancer. Cancer Research 2003, 63: 2649–2657.
Crnogorac-Jurcevic T, Missiaglia E, Blaveri E, Gangeswaran R, Jones M, Terris B, Costello E, Neoptolemos JP, Lemoine NR: Molecular alterations in pancreatic carcinoma: expression profiling shows that dysregulated expression of S100 genes is highly prevalent. Journal of Pathology 2003, 201: 63–74.
Friess H, Ding J, Kleeff J, Fenkell L, Rosinski JA, Guweidhi A, Reidhaar-Olson JF, Korc M, Hammer J, Buchler MW: Microarray-based identification of differentially expressed growth-and metastasis-associated genes in pancreatic cancer. Cell Mol Life Sci 2003, 60(6):1180–1199.
Grutzmann R, Boriss H, Ammerpoh O, Luttges J, Kalthoff H, Schackert H, Kloppel G, Saeger H, Pilarsky C: Meta-analysis of microarray data on pancreatic cancer defines a set of commonly dysregulated genes. Oncogene 2005, 1–10.
Bloomston M, Zhou J, Rosemurgy AS, Frankel W, Muro-Cacho CA, Yeatman TJ: Fibrinogen gamma overexpression in pancreatic cancer identified by large-scale proteomic analysis of serum samples. Cancer Research 2006, 66: 2592–2599.
Johnson SK, Dennis RA, Barone GW, Lamps LW, Haun RS: Differential expression of insulin-like growth factor binding protein-5 in pancreatic adenocarcinomas: identification using DNA microarray. Molecular Carcinogenesis 2006, 45: 814–827.
Crnogorac-Jurcevic T, Efthimiou E, Capelli P, Blaveri E, Baron A, Terris B, Jones M, Tyson K, Bassi C, Scarpa A, Lemoine NR: Gene expression profiles of pancreatic cancer and stromal desmoplasia. Oncogene 2001, 20: 7437–7446.
Crnogorac-Jurcevic T, Gangeswaran R, Bhakta V, Capurso G, Lattimore S, Akada M, Sunamura M, Prime W, Campbell F, Brentnall TA, Costello E, Neoptolemos J, Lemoine NR: Proteomic analysis of chronic pancreatitis and pancreatic adenocarcinoma. Gastroenterology 2005, 129(5):1454–1463.
Friedman J, Popescu BE: Gradient directed regularization. Technical Report, Stanford University Department of Statistics 2004.
This study has been supported by R01CA120988 and R03LM009828 from the NIH USA (SM and JH), and the CTSA award from Yale YCCI (SM). We would like to thank three referees for very insightful comments which have led to significant improvement of this article.
Both authors were involved in the study design, data analysis and writing. SM wrote the R code for data analysis. Both authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Ma, S., Huang, J. Regularized gene selection in cancer microarray meta-analysis. BMC Bioinformatics 10, 1 (2009). https://doi.org/10.1186/1471-2105-10-1
- Pancreatic Cancer
- Meta Analysis
- Gene Selection
- Meta Analysis Approach
- Gene Selection Method