Microarray studies provide a way of linking variations of phenotypes with their genetic causations. Constructing predictive models using high dimensional microarray measurements usually consists of three steps: (1) unsupervised gene screening; (2) supervised gene screening; and (3) statistical model building. Supervised gene screening based on marginal gene ranking is commonly used to reduce the number of genes in the model building. Various simple statistics, such as t-statistic or signal to noise ratio, have been used to rank genes in the supervised screening. Despite of its extensive usage, statistical study of supervised gene screening remains scarce. Our study is partly motivated by the differences in gene discovery results caused by using different supervised gene screening methods.

Results

We investigate concordance and reproducibility of supervised gene screening based on eight commonly used marginal statistics. Concordance is assessed by the relative fractions of overlaps between top ranked genes screened using different marginal statistics. We propose a Bootstrap Reproducibility Index, which measures reproducibility of individual genes under the supervised screening. Empirical studies are based on four public microarray data. We consider the cases where the top 20%, 40% and 60% genes are screened.

Conclusion

From a gene discovery point of view, the effect of supervised gene screening based on different marginal statistics cannot be ignored. Empirical studies show that (1) genes passed different supervised screenings may be considerably different; (2) concordance may vary, depending on the underlying data structure and percentage of selected genes; (3) evaluated with the Bootstrap Reproducibility Index, genes passed supervised screenings are only moderately reproducible; and (4) concordance cannot be improved by supervised screening based on reproducibility.

Background

Microarray techniques provide a way of monitoring gene expressions on a large scale. Biomedical experiments have been designed to discover important genes or gene pathways, that are linked with variations of phenotypes. Those genes can then be used as biomarkers in clinical studies and to construct predictive models in downstream analysis. Examples of such studies include disease classification studies in [1–3] and survival analysis in [4, 5], among many others.

Statistical analyses using gene expressions as covariates are very challenging due to high dimensionality of gene expression measurements and small sample sizes. Consider for example the Leukemia data [6], which is used as an example of binary classification in [7]. The data contains expression measurements of 6817 genes from 72 samples. We refer to [6] for experimental setup. A typical analysis, as presented in [7], consists of the following three steps.

1.

Data organization and unsupervised gene screening. In [7], this step consists of thresholding the raw measurements, filtering genes with small variations across all samples and logarithm transformation. 3571 genes pass the first stage screening. For other datasets, if severe missingness is present, simple data manipulation, such as filling in missing values, may also be needed.

2.

Supervised gene screening. Genes passed unsupervised screening are then ranked based on the ratio of their between-groups and within-groups sum of squares (referred as B/W hereafter). The 50 top ranked genes are selected for downstream statistical analysis. We note that the binary outcome is used in computing the B/W ratio.

3.

Predictive model building using the 50 selected genes. Various statistical methods, including classification tree, Fisher linear discriminant analysis and nearest neighbor approach, are used.

Similar three-step approaches have been extensively used. See for example, classification studies in [6, 8, 9] and survival analysis in [4, 10], among many others.

We now investigate this three-step procedure in more details. Steps 1 and 2 carry our gene screening, which is especially necessary under current "large p, small n" setting. The goal of the screening is three-fold: improving prediction performance by removing noninformative genes; providing faster and more cost-effective predictors; and providing a better understanding of the underlying causal relationships.

Step 1 is mainly due to technical concerns. For example, most statistical building methods in step 3 cannot handle missing data automatically, so we need to either remove genes with missing values or fill in with sample statistics. Under certain experimental setup, gene expression measurements above or below certain thresholds are not meaningful, so simple thresholding/flooring may be needed. Genes with little variations across samples are not likely to possess any biological functions of interest, so removing such genes may increase the signal to noise ratio. We note that in step 1 gene screening and data manipulation, information on the clinical outcome is not used. We hence refer it as the unsupervised gene screening.

In this article, step 2 screening is referred as the supervised gene screening. It differs from the unsupervised screening in the sense that the clinical outcome is used in gene screening. A typical supervised screening consists of the following steps:

(i)

Compute a marginal statistic for each individual gene. This statistic, for example the t-statistic in binary classification studies, is constructed using both the expression measurements and the clinical outcome.

(ii)

Rank genes based on their marginal statistics. For this purpose, although distribution of the marginal statistic does not need to be known, we do need to know the qualitative relationship between magnitude of this statistic and importance of corresponding gene, for example whether larger marginal statistics indicate more influential genes.

(iii)

Select the top ranked genes. We postpone discussion of how many genes need to be selected to the Discussions section.

In [7], the screening statistic is chosen to be the between-groups and within-groups sum of squares ratio, and the binary outcome is used to define the grouping. Although the distribution and other statistical properties of the B/W ratio do not need to be known, it is reasonable to say that genes with larger B/W can better predict the outcome and hence should be selected. Only the 50 top ranked genes are used in the statistical model building.

After gene screening in steps 1 and 2, predictive models can be constructed in step 3. Since the number of genes passed screening may still be much larger than the sample size, feature selection through regularization is usually needed along with estimation. Regularization methods used include partial least squares [8], LASSO [11], LASSO-LARS [12] and threshold gradient descent regularization (TGDR) [13], among many others. With the aforementioned regularized estimation approaches, only a small number of representative genes are included in the final models. The biological implications of those representative genes are of great scientific interest, and usually warrant more detailed investigation.

We note that not all three steps are needed in practical data analyses. Part of the screening can be omitted. For example, in the boosting study [14], all genes are used in the model building. In the above three-step procedure, step 1 is mainly due to technical concerns and is of less statistical interest. Step 3 has been intensively studied. See the aforementioned publications and references therein. However in previous studies, the supervised gene screening is usually taken for granted and has not been well investigated.

We first present a small numerical study to show the validity of the supervised screening. We consider the Colon and Leukemia data presented in the Results section as examples. If a gene screening method is valid, it should have certain reproducibility property. Especially, genes passed the screening in one subgroup should be similar to those screened in another independent subgroup. Since independent validation sets are usually not available, we consider the following bootstrap based approach.

1.

Randomly select 0.632n subjects, where n is the sample size.

2.

Select 20% genes using the chosen screening method.

3.

Repeat 1–2 1000 times.

4.

For each gene, compute the percentage of times it is included in the top 20% ranked gene lists.

The percentage computed here is closely related to the Bootstrap Reproducibility Index proposed below. We choose statistics 2, 5 and 8 (see the Methods section) as examples and show the percentage plot in Figure 1. Note that in Figure 1 we sort the genes based on decreasing percentages. We can see from Figure 1 that the percentages are far from being flat: there are some genes with very high percentages of being selected, which indicates that the supervised screening is reasonably reproducible. Studies with other screening statistics and other datasets show similar results and are omitted here.

A motivating example

This article is partly motivated by the following example. Consider the Leukemia data described in [6]. The data contains expression measurements of 6817 genes for 72 samples, among which 47 are ALL and 25 are AML. The clinical outcome of interest can be coded as a binary variable with the response equal to 1 if it is AML and 0 otherwise. We employ the same unsupervised screening as in [7] and 3571 genes pass the unsupervised screening.

For the supervised gene screening, we consider using the eight different ranking statistics listed in the Methods section to select the top 714 (20%) genes. In the statistical model building, we assume the commonly used logistic regression models, where the covariates are the 714 genes passed the unsupervised and supervised gene screenings and the outcome is the acute leukemia type. Since the sample size is much smaller than the number of covariates, we use the TGDR, which is capable of simultaneous estimation and gene selection, for regularized estimation. Estimation and gene selection using the TGDR has been studied in [13, 15]. The small number genes included in the final models are identified as important genes, and are concluded to be associated with the variations of phenotypes.

Eight possibly different sets of 714 genes passed the gene screenings are used to construct eight predictive models. Since the same unsupervised screening, the same logistic model and the same regularization method are used, differences (if any) in the eight final predictive models must be caused by the differences in the supervised screening. We show in Table 1 the number of genes included in the eight models constructed by logistic + TGDR, and their corresponding overlaps. For example, by using the difference of mean as the supervised screening statistic and fitting the logistic + TGDR model, 34 out of 714 genes are included in the final model; while 35 genes are included if the simple t-statistic is used as the supervised screening statistic. Only 22 genes are included in both models.

We can see from Table 1 that there exist considerable overlaps among genes included in the eight final models, showing moderate concordance of gene discovery results. However with different supervised screenings, the genes in the final logistic models may differ by more than 30%. A total of 66 genes are included in at least one of the eight final models, where as only 16 are included in all eight. More detailed gene discovery results are available upon request. We can conclude that for the Leukemia example, differences of the predictive models caused by differences in supervised gene screenings are significant and cannot be ignored. Similar results are observed for the Colon data and the Estrogen data.

Microarray studies like the Leukemia example have two main purposes. The first is to construct predictive models based on microarray measurements to guide future treatment selection. The second is to discover a small subset of genes that are accountable for variations of phenotypes. Identifying such influential genes may lead to better understanding of human genomics and new directions of gene therapy. From a scientific point of view, the second goal is at least as important as the first one. The Leukemia example shows that the effect of supervised gene screening, which has considerable effect on gene discovery results, warrants detailed investigation. Concordance problem in gene discovery by using different regularization methods has been studied. For example, several different gene discovery results have been reported for the diffuse large-B-cell lymphoma (DLBCL) data. See [16] for detailed discussions. In this article, we investigate the concordance in supervised gene screening, which has been neglected previously.

Another challenging aspect of microarray data analysis is the reproducibility. Empirical studies, for example [17], show that genes discovery results in one bootstrap sample are not necessarily reproducible in another one. Reproducibility in gene clustering is recently studied by [18]; Theoretical framework is established in [19, 20]; In addition [21] presents general reproducibility discussions of feature selection in microarray studies.

The goal of this study is two-fold. Firstly, concordance of different supervised gene screenings is investigated. Coupled with concordance study of regularized estimation methods, our study provides further insights into concordance of microarray gene discovery results using different statistical approaches. Secondly, reproducibility of individual genes in supervised screening is considered. The proposed Bootstrap Reproducibility Index provides more detailed reproducibility assessment than previous ones. We use empirical studies with four public microarray data to investigate the concordance and reproducibility. In the supervised screening, we follow the commonly used three-step procedure: computing (marginal statistics), ranking (based on those statistics) and selecting (top ranked genes). With slight abuse of terminologies, in the paper "different supervised screening methods" in fact means "supervised screening based on different marginal statistics". The rest of the paper is organized as follows. We discuss eight commonly used gene screening statistics in the Methods section and propose the Bootstrap Reproducibility Index (BRI). Empirical studies using four public datasets are provided in the Results section. Several related open questions are raised in the Discussions section. The article concludes with a short summary.

Results

Data Descriptions

Colon data

In this dataset, expression levels of 40 tumor and 22 normal colon tissues for 6500 human genes are measured using the Affymetrix gene chip. In the unsupervised gene screening, 2000 genes with the highest minimal intensity across samples are selected by [1]. The data is publicly available at [22].

Leukemia data

The leukemia dataset is described in [6] and available at [23]. This dataset comes from a study of gene expression in two types of acute leukemia: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Gene expression levels were measured using Affymetrix high density oligonucleotide arrays containing 6817 human genes. The data comprise 47 cases of ALL and 25 cases of AML. We use the same unsupervised screening as in [7]: (i) thresholding with floor of 100 and ceiling of 16000; (ii) filtering by excluding genes with max/min≤5 and max-min≤500, where max and min refer to the maximum and minimum expression levels of a particular gene across samples, respectively; and (iii) base 2 logarithm transformation. 3571 genes pass the unsupervised screening.

Estrogen data

This dataset was first presented in [2, 3]. It contains expression values of 7129 genes from 49 breast tumor samples. The expression data were obtained using the Affymetrix gene chip technology and are available at [24]. For the estrogen data, there are two different response variables available. The first one describes the status of the estrogen receptor (ER). 25 samples are ER+, whereas the remaining 24 samples are ER-. The second response variable describes the lymph nodal (LN) status, which is an indicator for the metastatic spread of the tumor. Here 24 samples are positive (LN+) and 25 samples are negative (LN-). We consider the same gene expression data coupled with the ER and LN outcomes, respectively, and refer them as Estrogen-ER and Estrogen-LN hereafter. For the unsupervised screening, we threshold the raw data with a floor of 100 and a ceiling of 16000. Genes with max/min ≤ 5 and/or max - min ≤ 500 are also excluded. 5146 genes pass the unsupervised screening. A base 2 logarithmic transformation is then applied.

Empirical study I: concordance

In the first empirical study, we consider concordance of gene sets passed the eight different supervised screening approaches. For the four datasets, 20% (40%, 60%) top ranked genes pass the supervised screening. We show the concordance results in Tables 2, 3, 4 (left panels), where we compute the relative fractions of overlapped genes between any two screening methods. For example for the Colon data in Table 2, we first rank the 2000 genes using the difference of mean (statistic 1) and the t-statistic (statistic 2) as marginal statistics. Then the top 400 genes (20%) are selected under each approach separately. 348 genes are identified by both methods, leading to 87% overlap.

We can observe that there are considerable overlaps between genes selected under different supervised screening methods. However, the overlaps are not perfect and the differences can be as large as 40% (Table 2; Estrogen-LN and Estrogen-ER). As m/d increases, the relative fractions of overlaps also increase, which suggests higher degree of concordance. However, even if 60% of genes pass the screening, the concordance may still be as low as ~75% for the Estrogen-LN data. Similar results are observed for the Lymphoma data [25], the NCI 60 data [26] and others. Empirical study I reveals that with commonly used supervised screenings, genes selected under different supervised screenings may be considerably different.

Empirical study II: reproducibility

Reproducibility of the supervised screening can be assessed with the proposed BRI. As stated in the Background section, only genes passed the supervised screening are used in statistical model building. So it is of great interest to see whether those genes are reproducible. Although the proposed BRI can measure the reproducibility of individual genes, we only present summary statistics (median and inter-quartile range) as the overall measurements of reproducibility for those selected genes. Results are shown in Table 5.

We can see that in general genes screened by the eight different methods are moderately reproducible. For example for Colon data when 20% genes are selected in the supervised screening, the medians of BRI are ~0.7 or 0.8, which roughly means that for the 400 genes passed supervised screenings, on average they pass the corresponding supervised screenings in 70% to 80% of the bootstrap samples. Another observation is that as m/d increases, the BRIs also increase. We can also see that the BRIs for one fixed data and different screening methods can be slightly different, which is believed to be caused by the underlying gene distributions. We note that our reproducibility results are better than those shown in [17]. This is caused by the small m/d in [17], where m = 50 and d > 4000.

Empirical study III: concordance of screening based on BRI

In [19], it is suggested that supervised gene screening should be based on reproducibility, i.e., instead of using marginal statistics based on all observations, a stability index should be computed for each gene based on certain marginal statistics and bootstrap random samples; genes then can be ranked based on this stability index; top ranked genes pass the supervised gene screening. Theoretical and empirical studies [19] show that predictive models can be more powerful if genes are screened based on reproducibility.

The focus of our study is the concordance of different supervised screening methods, instead of the predictive model building. However, it is of interest to see if statement in [19] can be extended to supervised screening, i.e, if concordance of supervised screening can be improved if genes are screened based on reproducibility measurement. We consider the following empirical study. For each gene screening statistic, we (1) compute the BRIs for all genes based on bootstrap samples; (2) rank the genes based on the BRIs; (3) identify the genes with the highest BRIs; and (4) compute the concordance between gene sets identified in (3). Compared with empirical study I, the same eight marginal statistics are used. However, in empirical study I, we compute the marginal statistic once for each gene, and the computation is based on all observations. In empirical study III, the marginal statistics are computed multiple times: once for each bootstrap sample. The statistic used to rank the genes is the BRI.

We show in Tables 2, 3, 4 (right panels) the concordance results if genes are screened based on the BRI. We can see that the results are very similar to those shown in the left panels. We do not observe significant improvement of concordance by using the BRI for screening. This observation can be partly explained by the fact that genes screened in empirical studies I and III are almost identical. For example for the Colon data when 20% genes are selected (Table 2), about 95% of genes passed the screening in empirical study I are also selected in empirical study III. Similar high overlaps are observed for other datasets.

Although further theoretical investigation is still needed, our empirical study leads to the conclusion that supervised screening based on reproducibility measurement cannot improve the concordance.

Discussion

Remark: how to choose supervised screening methods

Empirical studies above show that the effect of supervised screening on predictive model building is not ignorable. See Table 1 for example. Our study focuses on concordance and reproducibility measurements. However, we note that our study does not lead to any recommendations on how to choose the supervised screening methods. Such a question still remains open. Theoretically speaking, validity (in terms of consistent gene selection) of supervised gene screening depends on the unknown underlying model and data distribution. For practical data analysis, universally optimal supervised screening method is not expected to exist.

Although screening based on marginal statistics has been extensively used, recent studies [27, 28] show that supervised gene screening should also consider the correlation structures among genes, and marginal methods may not be optimal. In addition, [21] shows that a gene that is 'useless" by itself can be helpful in the joint models. Empirical study of adaptive selection of supervised screening method based on reproducibility will be studied in a later article.

Remark: how many genes should be selected

Our empirical studies show that as m/d increases, the concordance and reproducibility measurements of all eight screening methods increase. So from concordance and reproducibility point of view, larger m is preferred. However with larger m, more genes, including noisy genes, pass the supervised screening. This contradicts the noise-removal and model-reduction purposes of the supervised screening. The predictive models are expected to be less reliable, when more genes are used in the model building. So the number of genes passed supervised screening should balance between the concordance and reproducibility requirement and the predictive model building.

Theoretical studies [29] show that the number of genes can at most be in the order of exp(n) for a given sample size n. Such results provide an asymptotic guideline for determining the number of genes. However, to our best knowledge, there is no theoretical or empirical studies targeting the optimal choice of m for practical datasets with small n.

Remark: effect of unsupervised screening

In our empirical study, unsupervised screening is carried out for all four datasets. Our unsupervised screening follows [1] for Colon data and [7] for Leukemia and Estrogen data. It is believed that different unsupervised screenings will affect supervised screening and predictive model building results. Unfortunately, as for supervised screening, there has not been enough study of unsupervised screening. Previous used unsupervised screenings are case-specific and usually depend on the actual experimental settings. Without accessing the experimental setup and interacting with the original researchers, we are not able to provide an honest assessment of the unsupervised screening in our study. We refer to aforementioned publications for rationale of the specific unsupervised screening.

Remark: connections with detection of differentially expressed genes

In simple microarray settings such as the Apo AI study in [30], the goal is to detect genes differentially expressed. For binary classification problem such as the Colon data, we can also lower our goal from statistical model building to detection of genes that are differentially expressed between diseased and healthy subjects. If so, then all the eight statistics discussed in last section can be used to rank and detect differentially expressed genes. The reproducibility of such studies has been investigated in [31] and references therein.

Although in our study, supervised screening and detection of differential genes are nearly identical, in general they can be significantly different in the following sense. Firstly, detection of differential genes is usually under the simple setting with two sub-populations. The two sub-populations may come from different experimental settings and it is not always reasonable to assume statistical models linking gene expressions with the outcome, i.e, the causality does not necessarily exist. Supervised screening is used before a statistical model can be built. It is employed in much more general settings, for example in survival analysis or longitudinal studies. Secondly, supervised screening can be based on reproducibility. This has been proposed and proved to function. However, it is not clear whether reproducibility measurement can be used in differential gene detection. Thirdly, when defining differentially expressed genes, certain statistical properties of the marginal statistics need to be known. For example for genes not differentially expressed, the p-values for t-statistics are uniformly distributed. With supervised screening, we only need to rank the genes and the top ranked are selected. Hence only minimum properties of the ranking statistic need to be known. Fourthly, in detection of differential genes, the correlation structure of the genes has significant effect on the distribution of marginal statistics and hence detection results. See [32] for discussions. In the supervised screening, the distribution of the ranking statistics is of less interest: only the relative ranking of those (possibly correlated) statistics is used. So as long as the marginal distributions are fixed, the correlation structures among genes have no effect on the supervised screening.

Conclusion

In microarray studies, supervised gene screening is usually carried out before statistical model building. In this article, we investigate the concordance and reproducibility of supervised screening via empirical studies. Our study leads to the following conclusions: (1) effect of supervised screening on predictive model building and gene discovery should not be ignored. Explanations of gene discovery results should be with extra cautions, if supervised screening is used; (2) genes passed different supervised screenings can be considerably different. The concordance depends on the screening statistics, underlying data structures and number of genes selected; (3) as measured by the BRI, genes passed supervised screenings are only moderately reproducible and the reproducibility also depends on the number of genes selected; and (4) supervised screening based on reproducibility cannot improve concordance.

The goal of this study is to provide empirical evidence for the concordance and reproducibility problems in supervised gene screening, which has not been detailed studied previously. Several related questions still remain open and are listed in last section. Although of great importance, they are beyond the scope of this article.

Methods

Supervised screening statistics in binary classification

Clinical outcomes being considered in microarray studies include categorical outcome (presence or absence of disease; different stages of disease), censored survival outcome (occurrence time of disease related event) and continuous outcome (value of disease biomarker), among others. Although unsupervised and supervised gene screenings are needed for analyses of all aforementioned outcomes, we first focus on binary outcome because of its popularity and simplicity.

Denote the outcome of interest as Y, where subjects with Y = 1 are referred as diseased and otherwise healthy. Denote X as the length d vector of gene expressions. In addition, denote X^{D} and X^{H} as gene expressions for diseased and healthy subjects, respectively. We assume there exist n i.i.d observations with n^{D} diseased and n^{H} healthy subjects and n^{D}+ n^{H} = n. For gene j = 1,...,d, denote {\overline{X}}_{j}^{D} and {\overline{X}}_{j}^{H} as the sample means of gene expressions for diseased and healthy subjects, respectively. Denote {\widehat{\sigma}}_{j}^{D} and {\widehat{\sigma}}_{j}^{H} as the corresponding sample standard deviations. Denote T_{
j
}, j = 1...,d as the marginal statistics that are used to rank and screen genes. The following ranking statistics have been extensively used in previous studies.

1.

Difference of mean. The statistic for the j^{th} gene is defined as T_{
j
}= |{\overline{X}}_{j}^{D} - {\overline{X}}_{j}^{H}|. Top ranked genes have large T_{
j
}. Using the difference of mean as ranking criteria has been investigated in [20] and references therein.

2.

Simple t-statistic. For each gene, we first compute the pooled variance estimate as {\widehat{\sigma}}_{p}^{2} = {(n^{D} - 1){\widehat{\sigma}}_{j}^{D2} + (n^{H} - 1){\widehat{\sigma}}_{j}^{H2}}/(n^{D} + n^{H} - 2). The t-statistic is defined as T_{
j
}= ({\overline{X}}_{j}^{D} - {\overline{X}}_{j}^{H})/\widehat{\sigma}_{
p
}. A larger absolute value of T_{
j
}leads to higher rank. For binary classification, supervised screening using t-statistic is equivalent to using the correlation coefficient [21] and the B/W ratio [7]. The correlation coefficient can be used when the outcome is a continuously distributed variable. In that case, the simple t-statistic cannot be directly employed. When the outcome is categorical with more than two levels, the B/W can be directly used whereas the t-statistic can be modified to an F-type statistic.

3.

Signal to noise ratio [33]. The statistic is defined as {T}_{j}=|{\overline{X}}_{j}^{D}-{\overline{X}}_{j}^{H}|/\sqrt{{\widehat{\sigma}}_{j}^{D2}+{\widehat{\sigma}}_{j}^{H2}}. Interestingly, using the signal to noise ratio for binary classification yields the same ranking as using the binormal AUC [9], which is the ranking criteria from the binormal ROC method. We note that both the signal to noise ratio and the binormal AUC have been extensively used and can be extended to much more general cases.

4.

SAM (Significance Analysis of Microarray) with fudge factor. Ranking based on the SAM statistic has been extensively used. For discussion, see [34, 35]. The SAM statistic is modified from the simple t-statistic and defined as T_{
j
}= ({\overline{X}}_{j}^{D} - {\overline{X}}_{j}^{H})/(\widehat{\sigma}_{
p
}+ f), where f is the positive fudge factor. In our study, we simply set f = median(\widehat{\sigma}_{
p
}). Data-adaptive fudge factor selection has been discussed in [35].

5.

Wilcoxon rank-sum statistic. For gene j, the expression levels are ranked first. Denote {R}_{j}^{D} as the sum of ranks for the expression levels of diseased subjects. The Wilcoxon statistic is computed as {T}_{j}={n}^{D}{n}^{H}+\frac{{n}^{D}({n}^{D}+1)}{2}-{R}_{j}^{D}.

6.

Kolmogorov-Smirnov statistic. For gene j, nonparametric estimates of the gene expression distribution functions for diseased and healthy subjects are separately computed. The KS statistic is defined as the maximum distance between the two estimated distributions.

7.

Estimated coefficient from marginal logistic regression. We first normalize all genes to have unit variances. The marginal logistic regression for the j^{th} gene assumes logit(E(Y = 1|X_{
j
})) = α_{
j
}+ β_{
j
}X_{
j
}, with unknown intercept α_{
j
}and regression coefficient β_{
j
}. T_{
j
}is set as the absolute value of the maximum likelihood estimate of β_{
j
}.

8.

P-value from marginal logistic regression. T_{
j
}is set as the p-value corresponding to the estimate of β_{
j
}from the logistic model in 7.

Statistics 1–4 are related to the simple t-statistic, which is a parametric statistic based on the normal or asymptotic normal distribution assumption. The difference of mean can be obtained from t-statistic by assuming equal variances across genes. If we ignore the difference between the number of diseased and healthy subjects, t-statistic can be simplified as the signal to noise ratio. The SAM statistic is a simple modification of the t-statistic, mainly due to the concern that extremely large t-statistic may be caused by small sample size and hence small variance estimate. The fudge factor is supposed to pull the extreme variance estimates towards their average. The Wilcoxon and Kolmogorov-Smirnov statistics are nonparametric. They rely on less assumptions on the marginal distributions than the t-statistic. However, the drawback is that nonparametric statistics may not be powerful enough, especially for microarray data when the sample size is very small. Statistics 7 and 8 are based on marginal logistic models, which are the most commonly assumed robust models in binary classification. Such model based methods are less common in classification study, but very popular in survival analysis [4] and linear regressions.

Loosely speaking, all the eight supervised screening statistics are designed to test the marginal hypothesis H_{0j}: E({X}_{j}^{D}) = E({X}_{j}^{H}). If we assume that the gene expressions for the diseased and healthy subjects are both normally distributed with the same variance but different means, then for fixed d and n → ∞, all the eight screening methods can consistently identify differentially expressed genes and hence properly screen genes. Thus the eight different screening methods are valid and the ranking/screening results should be concordant asymptotically. However, for gene expression data with small sample sizes, the normal distribution assumption may not be satisfied. In addition, as shown in [29], even when the normal distribution assumption is satisfied, finite sample performances of different screening statistics may still be considerably different.

Concordance measurement

We consider the following concordance measurement for gene sets passed different supervised screenings. Assume that we select m top ranked genes based on different supervised screening statistics. For any two sets, concordance is measured by the percentage of overlap. Beyond depending on the underlying data structures and the ranking statistics used, the proposed concordance measurement also depends on the ratio of m/d, as shown in the Results section. For example in the extreme case of m/d ~ 1, almost all genes are selected and concordance is close to 1 for any screening methods. In our empirical studies, we consider three different m/d ratios.

An alternative concordance measure is the preservation degree of ranking based on two different ranking statistics. Studies using such concordance measurement can be found in [31]. This is the proper measure of concordance if ranking of the selected genes is of interest, for example in study of detecting differentially expressed genes where higher ranked genes warrant more detailed studies. In supervised gene screening, the purpose of ranking and screening is to provide a set of working genes for downstream model building. So the relative ranking of the screened genes is of less interest. For this reason, the proposed concordance measure is proper.

Bootstrap Reproducibility Index

Genes that are more reproducible carry stronger and more stable information of the causal relationship. In [19], it is proposed that gene ranking and screening can be based on reproducibility, where reproducibility is an overall measurement evaluated based on the number of overlapped genes among bootstrap samples. Although having sound theoretical basis, such reproducibility measurement focuses on the overall reproducibility of ranking/grouping methods instead of the reproducibility of individual genes.

Motivated by aforementioned studies, we consider the following Bootstrap Reproducibility Index (BRI), which shares the same spirits as the occurrence index measurement proposed in [36]. The BRI is computed as follows.

1.

Randomly sample n_{1} subjects from the n observations without replacement. In our study, we propose using n_{1} ~ 0.632 × n.

2.

For each bootstrap sample and a fixed gene screening method, compute the marginal statistics {T}_{j}^{\ast} for gene j = 1,..., d. Select the m top ranked genes.

3.

Repeat steps 2 and 3 B = 1000 times.

4.

For gene j, compute O_{
j
}: the number of times this gene is included in the m top ranked genes out of the B bootstrap samples.

5.

The BRI for gene j under the chosen screening statistic is defined as BRI_{
j
}= O_{
j
}/B.

We generate random bootstrap samples in step 1. In the ideal case, reproducibility should be evaluated using independent samples. Since such samples usually do not exist, we create random samples using bootstrap. The "0.632" bootstrap without replacement is investigated in [37]. The rationale is that if we sample n subjects with replacement, then the expected number of unique observations is 0.632n. Bootstrapping and screening are iterated in step 3. The BRI computed in step 5 is defined as the relative fraction of occurrence.

For a specific subset of genes, we can compute simple summary statistics, for example mean or median of the individual BRIs, as the overall reproducibility measurement. Especially, since genes passed the supervised screening will be used in downstream model building, reproducibility of those genes are of special interest. In our empirical studies, we compute the median and inter-quartile range of BRI for those genes.

The proposed BRI is closely related to the reproducibility measurement in [19]. If the reproducibility measurement in [19] is high, then the rank of genes is preserved across bootstrap samples, which means that a subset of genes rank high in most bootstrap samples. Since those genes will pass the supervised screening in most bootstrap samples, they also have large BRIs.

Supervised screening in survival analysis and linear regression

Supervised gene screening is also needed for analysis of survival type and continuous outcomes. In such studies, previously proposed supervised screenings are often model based. For example in [4], marginal Cox models using the survival outcome and individual gene expressions as covariates are first fit. Supervised screening statistic is chosen as the p-value from the marginal Cox model. For survival type outcome, an alternative approach is to consider each time point separately. Then the event indicator, which is binary, can be used as the outcome. The screening statistics for binary outcome listed above can then be adopted. One example is the time-dependent ROC approach [38], which is extended from the ROC method for binary classification. The time-integrated AUC can be used as the screening statistic.

For continuous outcomes, marginal linear models can be fit, and the ranking statistic can be chosen as the p-value or the actual value of the regression coefficient. Correlation coefficient, which is used in binary classification, is also applicable for data with continuous outcomes. Another alternative approach is to dichotomize continuous outcomes and create categorical variables. Then screening statistics for binary classification can be adopted. We postpone investigation of supervised screening with survival and continuous outcomes to a future study.

References

Alon U, Barkai N, Notterman D, Gish K, Mack S, Levine J: Broad Patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.PNAS 1999, 96: 6745–6750. 10.1073/pnas.96.12.6745

Spang R, Blanchette C, Zuzan H, Marks J, Nevins J, West M: Prediction and uncertainty in the analysis of gene expression profiles.Proceedings of the German Conference on Bioinformatics GCB 2001.

West M, Blanchette C, Dressmna H, Huang E, Ishida S, Spang R, Zuzan H, Olson J, Marks J, Nevins J: Predicting the clinical status of human breast cancer by using gene expression profiles.PNAS 2001, 98: 11562–11467.

Dave SS, Wright G, Tan B, Rosenwald A, Gascoyne RD, Chan WC, Fisher RI, Braziel RM, Rimsza LM, Grogan TM, Miller TP, LeBlanc M, Greiner TC, Weisenburger DD, Lynch JC, Vose J, Armitage JO, Smeland EB, Kvaloy S, Holte H, Delabie J, Connors JM, Lansdorp PM, Ouyang Q, Lister TA, Davies AJ, Norton AJ, Muller-Hermelink HK, Ott G, Campo E: Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells.The New England Journal of Medicine 2004, 351: 2159–2169. 10.1056/NEJMoa041869

Nguyen D, Rocke DM: Partial least squares proportional hazard regression for application to DNA microarray data.Bioinformatics 2002, 18: 1625–1632. 10.1093/bioinformatics/18.12.1625

Ghosh D, Chinnaiyan AM: Classification and selection of biomarkers in genomic data using LASSO.Journal of Biomedicine and Biotechnology 2005, 2: 147–154. 10.1155/JBB.2005.147

Gui J, Li HZ: Penalized Cox Regression Analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data.Bioinformatics 2005, 21: 3001–3008. 10.1093/bioinformatics/bti422

Ma S, Huang J: Regularized ROC method for disease classification and biomarker selection with microarray data.Bioinformatics 2005, 21: 4356–4362. 10.1093/bioinformatics/bti724

Guyon I, Elisseeff A: An introduction to variable and feature selection.Journal of Machine Learning Research 2003, 3: 1157–1182. 10.1162/153244303322753616

Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson JJ, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large B-Cell lymphoma identified by gene expression profiling.Nature 2000, 403: 503–511. 10.1038/35000501

Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, VandeRijn M, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO: Systematic variation in gene expression patterns in human cancer cell lines.Nature Genetics 2000, 24: 227–234. 10.1038/73432

Ng V, Breiman L: Bivariate variable selection for classification problem.Technical report, Department of Statistics, University of California-Berkeley 2005.

Yang YH, Dudoit S, Luu P, Speed TP: Normalization for cDNA Microarray Data.Microarrays: Optical Technologies and Informatics, Vol. 4266 of Proceedings of SPIE 2001, 141–152.

Qiu X, Xiao Y, Gordon A, Yakovlev A: Assessing stability of gene selection in microarray data analysis.BMC Bioinformatics 2006., 7(50):

Qiu X, Brooks A, Klebanov L, Yakovlev A: The effects of normalization on the correlation structure of microarray data.BMC Bioinformatics 2005., 6(120):

Lai C, Reinders MJT, Wessels LFA: Multivariate gene selection: Does it help?IEEE Computational Systems Biology Conference, Stanford 2005.

The author thanks Dr. Jian Huang for insightful discussions, and two referees for very valuable suggestions that have led to significant improvement of this paper. The author would like to thank Yale Center for High Performance Computation in Biology and Biomedicine (NIH grant: RR19895-02) for computational support.

Author information

Authors and Affiliations

Department of Epidemiology and Public Health, Yale University, New Haven, CT, 06520, USA

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.