Re-sampling strategy to improve the estimation of number of null hypotheses in FDR control under strong correlation structures
© Lu and Perkins; licensee BioMed Central Ltd. 2007
Received: 15 November 2006
Accepted: 18 May 2007
Published: 18 May 2007
When conducting multiple hypothesis tests, it is important to control the number of false positives, or the False Discovery Rate (FDR). However, there is a tradeoff between controlling FDR and maximizing power. Several methods have been proposed, such as the q-value method, to estimate the proportion of true null hypothesis among the tested hypotheses, and use this estimation in the control of FDR. These methods usually depend on the assumption that the test statistics are independent (or only weakly correlated). However, many types of data, for example microarray data, often contain large scale correlation structures. Our objective was to develop methods to control the FDR while maintaining a greater level of power in highly correlated datasets by improving the estimation of the proportion of null hypotheses.
We showed that when strong correlation exists among the data, which is common in microarray datasets, the estimation of the proportion of null hypotheses could be highly variable resulting in a high level of variation in the FDR. Therefore, we developed a re-sampling strategy to reduce the variation by breaking the correlations between gene expression values, then using a conservative strategy of selecting the upper quartile of the re-sampling estimations to obtain a strong control of FDR.
With simulation studies and perturbations on actual microarray datasets, our method, compared to competing methods such as q-value, generated slightly biased estimates on the proportion of null hypotheses but with lower mean square errors. When selecting genes with controlling the same FDR level, our methods have on average a significantly lower false discovery rate in exchange for a minor reduction in the power.
Microarray technology has become a standard experimental method in bio-medical research. In the analysis of microarray data, one of the most fundamental tasks is the identification of differentially expressed genes while controlling false positives and minimizing false negatives. This is a multiple hypothesis test problem which analyzes thousands or tens of thousands of genes simultaneously. In these tests we often need to control the false discovery among the rejected hypotheses under a pre-specified level while maintaining maximal power. Thus, there is a trade off in the control of the type-I error between rejecting true null hypotheses (false discovery) versus accepting true alternative hypotheses (false negative).
Traditional Bonferroni correction procedures are designed to control the Family Wise Error Rate (FWER), which guards against making one or more type I errors among a family of hypothesis tests. However, these procedures may be excessively conservative for microarray analysis where the number of hypotheses is very large and a substantial fraction of the genes are differentially expressed . A more appropriate approach is to control the False Discovery Rate (FDR), which is the proportion of type I errors among all rejected hypotheses [2, 3]. This approach is particularly useful in exploratory analyses, where the objective is to maximize the discovery of true positives, rather than guarding against one or more false positive results.
A number of methods have been proposed to control the FDR given a population of hypothesis tests. These methods usually assume that the distribution of the test statistics, f, can be modeled by a mixture of two components :f(x) = π0f0(x) + (1 - π0)f1(x) π0 = m0/m
Where f0 is the distribution of the test statistics under H0, which by definition equals to 1 when using p-values when tests are independent, f1 is the distribution of the test statistics under H1, m0 is the number of true H0, m is the total number of hypotheses under consideration, and π0 is the proportion of true H0. The methods proposed by Benjamini et al [2, 3] to control FDR do not estimate π0; therefore, they provide the strongest controls on FDR but have the lowest power compared to other methods that do so.
In many actual applications where a considerable number of genes are differentially expressed, assuming π0 = 1 may be too conservative causing loss of power. Several alternative methods, such as nonparametric empirical Bayesian pFDR criterion and its p-value equivalent called q-value method [1, 5, 6], bin-wise model [7–9], local FDR method , parametric beta-uniform mixture models [11–14], the Lowest Slope estimator (LSL) , the Spacing LOESS Histogram (SPLOSH) method , the nonparametric MLE method , the moment generating function approach , and the Poisson regression approach [18–20], have all been proposed to estimate π0 by pooling test statistics and controlling FDR based on the estimated π0.
In these methods, one of the critical steps is estimating the proportion of null hypotheses, π0. When using p-values, these estimations usually depend on the assumption that f0 follows a uniform distribution. This assumption, which is of critical importance for the methods of statistical inference that employ pooling test statistics across genes , is valid when all test hypotheses are independent and identically distributed. Furthermore, when there are only weak correlations, or "clumpy" correlations (a large number of groups that have a small number of genes with high correlation within groups but no correlation between groups [21, 22]), the uniform assumption is not strongly violated and the method remains adequate. However, in datasets with large scale strong correlations, the joint distribution of the test statistics will no longer be the product of marginal distributions, and the observed f0 will severely deviate from uniform, causing the current π0 estimation methods to become very unstable. Increased variation and bias of π0, as well as FDR, was also observed by Wu et al  in datasets with strong local correlations.
The effect of correlation on simultaneous significance tests was previously discussed theoretically [23–25], and a number of permutation based FDR control methods were proposed, such as SAM , dChip , Ge et al , Meinshausen et al  and Efron . In these methods, the distribution of f0 was modeled empirically through permutations, which naturally considered the correlation. However, like Benjamini et al [2, 3], these methods don't estimate π0; therefore, in datasets with a large number of differentially expressed genes, the FDR control may be overly conservative with a loss of power.
Therefore we proposed 2 re-sampling schemes, similar to model averaging in bagging methods, to reduce the variation in estimating π0 in datasets with strong correlation between gene expression values. Our methods produced a more stable and conservative estimation of π0 and, therefore, provided stronger control of False Discovery Rate with only a minor sacrifice of power.
Creating simulated data set
To test the performance of various algorithms in estimating π0, we generated 2 types of simulated datasets. Both datasets had strong correlation between subsets of genes and a known proportion of true null hypotheses, to represent the log transformed microarray expression data.
The first simulation method is similar to the method used in Qiu et al  and Wu et al . Assume n samples and m genes, with n/2 samples per class. The m genes were divided into K blocks with each block consisting of m/K genes. Assume independence between blocks and constant correlation coefficient between genes within each block. For block l and sample j, we first created a block center vectorb lj = d l ·x j + δ lj , l = 1,...,K, j = 1,...,n
where d l was the mean difference between groups, it equals to 0 (for simulating non-differentially expressed genes) with probability π0, or was generated from beta distribution with parameter (4, 20) otherwise; x j was a group indicator; and δ lj was i.i.d. N(0,1) to represent sample specific noise. Then the expression value of gene i in sample j in that block, Y lji , was generated byY lji = ρ·b lj + (1 - ρ)·e lji , i = 1,...,m/K
where ρ was the correlation constant which takes value between [0,1] determining the correlation coefficient between genes within the block, and e lji was i.i.d. N(0,1). The K blocks of genes were generated independently of each other and then pooled to form the whole dataset. We call this type of dataset which contains blocks of highly correlated genes Data-B in our experiments.
The above Data-B model is over-simplified in many aspects, and is still a "clumpy" structure, although the clumpiness can be pronounced. To mimic more realistic situations, we generated a second type of simulated data based on an actual human breast cancer microarray dataset  obtained with Affymetrix U133 plus 2.0 microarrays. The dataset contains 65 estrogen receptor positive (ER+) cases and 46 estrogen receptor negative (ER-) cases. The data were normalized by the GCRMA algorithm[30, 31], and the gene (probe-set) expression levels were log2-transformed. According to published literatures [32, 33], the ER status is one of the most predominant partitioning factors for molecular classification of breast cancer. We therefore took some of the genes differentially expressed between the two classes as "truly H1" genes. We selected 8778 genes with differences in mean of log transformed expression levels between the two classes greater than 0.58 (equivalent to a 1.5 fold change). From these genes, each time we randomly chose 1000 to form our simulated dataset and then randomly picked π0 proportion of the 1000 genes to establish H0 genes by scrambling these genes together between samples. Thus, we obtained a simulated dataset with a known number of null hypotheses while the correlations among both the differentially and non-differentially expressed genes were maintained. We call this type of dataset Data-M in our experiments .
Estimating π0by re-sampling strategy
To get a better estimation of π0 in datasets with strong correlations, we proposed 2 re-sampling methods to replace the original π0 estimation step in q-value method which estimates π0 directly from the p-value distribution .
The first method, termed SampS, re-samples without replacement 2/3 of the samples from each class, calculates p-values for each gene, and then estimates the π0 from the p-values. For each dataset, we performed this procedure 100 times and used the upper quartile to replace the π0 estimated by the q-value method.
The second method, termed SampG, re-samples without replacement 2/3 of the values from each class for each gene independently, calculates the p-values for each gene, and then estimates the π0. We also performed this procedure 100 times for each dataset and used the upper quartile to replace the π0 estimated by the q-value method.
After each re-sampling step, we have to feed the p-values into a π0 estimator. This estimator could be the π0 estimation by q-value method, the mgf method , or any other unbiased π0 estimators. In this paper, we tried to use both q-value and mgf as the plug-in estimator, and called them SampS.q and SampG.q when using q-value as the plug-in estimator, or SampS.m and SampG.m when using mgf as the plug-in estimator.
Variation of the π0estimation when genes are correlated
Comparing Figure 2, Figure 3 and Figure 4, it is apparent that increased correlation among genes greatly increased the deviation of p-value distribution from uniform. Therefore, strong correlation structures will increase the variation in estimated π0. And inevitably subsequent statistical inferences and false discovery control procedures would be influenced by this unstable π0 estimation.
Improving the π0estimation by re-sampling strategies
To improve the estimation of π0 in datasets with strong correlations, we proposed two methods, termed SampS and SampG, to replace the original π0 estimation step in the q-value method. In the SampS algorithm, we used a model averaging strategy. We repeatedly sampled 2/3 of the data from each class without replacement, calculated the p-values for genes, estimated π0 from the p-value distribution and finally used the upper quartile of the re-sampling estimations in the subsequent statistical inferences. In the SampG algorithm, we further broke down the correlations between genes and stabilized the π0 estimation. In this algorithm, we repeatedly sampled without replacement 2/3 of the expression values from each class independently for each gene, calculated their p-values, estimate π0, and then used the upper quartile of the re-sampling estimations in subsequent analysis. For the choice of the plug-in π0 estimator, we tried to use both q-value  and moment generating function  methods, and called them SampG.q, SampS.q and SampG.m, SampS.m, respectively.
To test the variation in π0 estimation induced by strong correlation structures, and the performance of our proposed SampS and SampG methods, we created simulated datasets Data-B and Data-M, with true π0 being 0.9, 0.8, 0.7, 0.6, and the correlation constant ρ being 0.3, 0.5 and 0.7, respectively. We created 100 datasets for each combination of these control parameters. In Data-B, we created 1000 genes forming 10 blocks with 100 genes per block to simulate large scale correlation between genes. In Data-M we randomly selected 1000 genes with differences in the mean of log transformed expression levels greater than 0.58, then randomly scrambled 90%, 80%, 70% or 60% of them to create datasets with known proportion of true null hypothesis and strong gene-gene correlations. We applied the SampS and SampG strategies to estimate π0 for each of the simulated datasets, and compared our results to a number of other methods.
q-value method with tuning parameter chosen by smoother method (Qvalue) ; function qvalue
The Spacing LOESS Histogram (SPLOSH) ; function splosh
The beta-uniform mixture model (BUM) ; function bum.mle
The nonparametric MLE method (convest); function convest
The moment generating function (mgf) ; function p0.mom
SampG method, with q-value plug in, 2nd quartile (SampG.q Q2)
SampG method, with q-value plug in, 3rd quartile (SampG.q Q3)
SampS method, with q-value plug in, 2nd quartile (SampS.q Q2)
SampS method, with q-value plug in, 3rd quartile (SampS.q Q3)
SampG method, with mgf plug in, 2nd quartile (SampG.m Q2)
SampG method, with mgf plug in, 3rd quartile (SampG.m Q3)
SampS method, with mgf plug in, 2nd quartile (SampS.m Q2)
SampS method, with mgf plug in, 3rd quartile (SampS.m Q3)
Mean Square Error of π0 estimation by different methods
Table 1 listed the MSE of π0 estimations by various methods. To better understand the methods tested, we listed both the 2nd and 3rd quartiles of the SampS and SampG methods compared to other methods. Later, we used only the 3rd quartile to provide strong control of FDR. From Figure 5 and Table 1 we can see that the bootstrap, Qvalue and SPLOSH methods are very sensitive to correlation and have higher MSE, especially the SPLOSH methods tend to under-estimate higher π0 but over-estimate lower π0; the BUM, convest and PRE methods tend to under-estimate π0 in most of the simulated data sets with strong correlations, which is not favorable in FDR controls; the adaptive method tends to over-estimate π0 when the correlation is not very strong, which is also observed by  with independent and weak correlated data, but it worked better on cases with strong correlations when the MSE were the least among all tested methods; mgf method is generally the second best among the above, with relatively small bias and variation among all simulated cases. In terms of SampS and SampG methods, both the 2nd and 3rd quartile outperformed the corresponding plug-in estimator, due to smoothed variation by model averaging, and SampG performs better than SampS with smaller MSE because SampG breaks down correlation between genes whereas SampS does not. And since mgf outperforms the q-value method, the SampG and SampS with the mgf plug-in also outperforms SampG and SampS with the q-value plug-in.
P-values comparing FDR by different methods.
P-values comparing power by different methods.
Percentage in decrease of mean FDR
Percentage in decrease of mean power
Comparing the re-sampling strategy to bootstrap p-values directly
When there is strong correlation between genes, and thus also between p-values, bootstrapping p-values does not change the correlation structure and therefore the estimations are still unstable. In contrast, re-sampling of samples or gene expression values as we proposed could address the variations induced by the correlation structure and therefore smooth the estimation.
In actual microarray datasets, genes expression is often correlated due to co-regulation, sharing of transcription factor binding motifs, or technical reasons such as sequence similarity, cross-hybridization or signal leak during hybridization. This is of critical importance for statistical inferences that rely on pooling of test statistics across genes . The distribution of p-values of these correlated genes can be viewed under a mixture model where groups of highly correlated genes share similar p-values and the whole distribution is actually a mixture of components corresponding to the groups of highly correlated genes. The effect of strong and large scale correlations is equivalent to reducing the total number of independent components in this mixture model. Storey  argued that subsets of genes fall into small but highly correlated groups due to co-regulation or cross-hybridization, but these groups are small in size and nearly independent with each other ("clumpy dependency"), therefore the uniformity of p-value distribution of true null genes would not be strongly affected. However, other researchers, such as Qiu et al , have found that the impact of correlation may be quite strong. It is also true that in our permutation study on a breast cancer microarray dataset, the distribution of p-values could be extremely far from uniform due to gene-gene correlation.
Systems biology research has shown that biological gene networks have a scale free , hierarchical structure [37, 38], where most of the genes are connected to a small number of other genes forming small groups of complexes, while some "hub" genes may be connected to large number of peripheral genes. The distribution of connectivity degree (the number of genes being connected to a given gene) decreases with a power-law, which is much slower than the exponential decay expected in a random network . These gene-gene interactions may be reflected by co-regulation or correlation in expression under certain conditions, and the possible scale of gene interaction is unlimited given the scale free structure. Therefore, large scale correlation of gene expression levels is not surprising in microarray studies.
We have shown in our simulated study that with the growth of correlation structures, the p-value distribution of H0 genes increasingly deviates from a typical uniform distribution. This may influence the estimation of π0 and the following statistical inferences. The effect of strong correlation was also discussed by other researchers . To solve this problem, we proposed the SampS and SampG methods. These algorithms replaced the unbiased but unstable π0 estimation step in the q-value method with a model averaging procedure of re-sampling samples or furthermore re-sampling independently for each gene to partially break the correlations between genes. Strong correlation between genes will inevitably increase the variation of the π0 estimation, even though the variation could be partially smoothed by the re-sampling strategies proposed in this paper. Therefore, it is necessary to compromise between safety and efficiency; in this case, we would like to shrink the estimation toward 1 from an unbiased estimation to guarantee a strong control of FDR. That is why we used the upper quartile instead of the median of the re-sampling estimations, although medians had a smaller MSE in estimating π0 in our simulated studies. We showed in our simulations that these plug-in revisions, compared to the q-value method, can greatly reduce the variation of the π0 estimation under strong gene-gene correlations, and enhance the performance of FDR control by reducing false discovery rate up to 40% with a reduction of power less than 1% compared to q-value method.
In our study, to create datasets with a known proportion of true null hypotheses while still having a similar correlation structure to that in actual microarray datasets, we developed 2 strategies to generate simulated datasets. The first one, Data-B, is simply a block-wise structure with arbitrary block size and intra-block correlation, but independent between blocks. When the block size was small, this was similar to the "clumpy" hypothesis . When the block size became bigger, we showed that the correlations influenced the π0 estimation, and the re-sampling strategies we proposed improved the performance of gene selection by significantly reducing the FDR with a minor reduction of power. To mimic a more realistic scenario, we also developed the Data-M strategy to generate simulated data from actual microarray datasets by arbitrarily permuting a given proportion of the genes. This permutation breaks the correlation between arbitrarily assigned differentially and non-differentially expressed genes, but maintains the correlation within the 2 groups of genes.
The SampS and SampG methods we proposed and tested in this paper are simple revisions, but they greatly improved the π0 estimations. The same approach using independent re-samples of expression values to estimate the π0 and then using the upper quartile of the re-sampling estimations in FDR control, could be applied to other FDR algorithms in data where strong correlation between hypotheses exists. In this paper we considered the typical 2-sample comparison problem with a reasonable number of independent replicates. For more complex problems, such as time-course studies, the SampG method (re-sample gene expression values independently for each gene) may not be able to be applied directly without a reasonable number of replicates in each time point, but the SampS method (re-sample samples) may be applicable if there is sound reason to assume the existence of stationary time patterns in the biological system under investigation.
Availability and requirements
The R code of SampG and SampS methods, and R code to generate simulated data sets Data-B and Data-M, is included as Additional file 1.
Requirements: R environment.
Operating systems: Windows XP or Linux.
The authors want to thank Dr. Ping Ma, Charles Berry and Anthony Gamst for valuable advice and proofreading of the manuscript. The comments by an anonymous referee and the assistant editor are also gratefully acknowledged. The work was partially supported by NCI grant RO1 AI055056, ACS-IRG #70-002, and NCI Special Cancer Center Support Grant CA23100-22.
- Storey JD, Taylor JE, Siegmund D: Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Statist Soc B 2004, 66: 187–205.View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Statist Soc B 1995, 57: 289–300.Google Scholar
- Benjamini Y, Yekutieli D: The Control of the False Discovery Rate in Multiple Testing Under Dependency. The Annals of Statistics 2001, 29: 1165–1188.View ArticleGoogle Scholar
- Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes Analysis of a Microarray Experiment. Journal of the American Statistical Association 2001, 96: 1151–1160.View ArticleGoogle Scholar
- Storey JD: A direct approach to false discovery rates. J R Statist Soc B 2002, 64: 479–498.View ArticleGoogle Scholar
- Storey JD, Tibshirani R: Statistical significance for genomewide studies. PNAS 2003, 100: 9440–9445.PubMed CentralView ArticlePubMedGoogle Scholar
- Scheid S, Spang R: A false discovery rate approach to separate the score distributions of induced and non-induced genes. In Proceedings of the 3rd International workshop on distributed statistical computating (DSC2003). Vienna, Austria; 2003:20–22.Google Scholar
- Scheid S, Spang R: A Stochastic Downhill Search Algorithm for Estimating the Local False Discovery Rate. IEEE Transactions on Computational Biology and Bioinformatics 2004, 1: 98–108.View ArticlePubMedGoogle Scholar
- Scheid S, Spang R: twilight; a Bioconductor package for estimating the local false discovery rate. Bioinformatics 2005, 21(12):2921–2922.View ArticlePubMedGoogle Scholar
- Aubert J, Bar-Hen A, Daudin JJ, Robin S: Determination of the differentially expressed genes in microarray experiments using local FDR. BMC Bioinformatics 2004, 5: 125.PubMed CentralView ArticlePubMedGoogle Scholar
- Allison DB, Gadbury GL, Heoc M, Fernandez JR, Lee CK, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data. Computational Statistics & Data Analysis 2002, 39: 1–20.View ArticleGoogle Scholar
- Pounds S, Morris SW: Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 2003, 19(10):1236–1242.View ArticlePubMedGoogle Scholar
- Liao JG, Lin Y, Selvanayagam ZE, Shih WJ: A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics 2004, 20(16):2694–2701.View ArticlePubMedGoogle Scholar
- Wu B, Guan Z, Zhao H: Parametric and nonparametric FDR estimation revisited. Biometrics 2006, 62(3):735–744.View ArticlePubMedGoogle Scholar
- Hsueh HM, Chen JJ, Kodell RL: Comparison of methods for estimating the number of true null hypotheses in multiplicity testing. J Biopharm Stat 2003, 13(4):675–689.View ArticlePubMedGoogle Scholar
- Pounds S, Cheng C: Improving false discovery rate estimation. Bioinformatics 2004, 20(11):1737–1745.View ArticlePubMedGoogle Scholar
- Langaas M, Lindqvist BH: Estimating the proportion of true null hypotheses, with application to DNA microarray data. J R Statist Soc B 2005, 67: 555–572.View ArticleGoogle Scholar
- Broberg P: A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 2005, 6: 199.PubMed CentralView ArticlePubMedGoogle Scholar
- Efron B: Large-scale simultaneous hypothesis testing : the choice of a null hypothesis. Journal of the American Statistical Association 2004, 99: 96–104.View ArticleGoogle Scholar
- Efron B, Tibshirani R: Using specially designed exponential families for density estimation. The Annals of Statistics 1996, 24: 2431–2461.View ArticleGoogle Scholar
- Qiu X, Klebanov L, Yakovlev A: Correlation between gene expression levels and limitations of the empirical bayes methodology for finding differentially expressed genes. Stat Appl Genet Mol Biol 2005, 4: Article34.PubMedGoogle Scholar
- Storey JD: comment on “Resampling-based multiple testing for DNA microarray data analysis” by Ge, Dudoit and Speed. Test 2003, 12: 1–77.View ArticleGoogle Scholar
- Westfall PH, Young SS: Resampling-based Multiple Testing. In Wiley Series in Probability and Mathematical Statistics. Edited by: Barnett V, Bradley RA, Fisher NI, Hunter JS, Kadane JB, Kendall DG, Smith AFM, Stigler SM, Teugels JL, Watson GS. New York , John Wiley & Sons; 1993.Google Scholar
- Meinshausen N: False Discovery Control for Multiple Tests of Association Under General Dependence. Scandinavian Journal of Statistics 2006, 33: 227–237.View ArticleGoogle Scholar
- Efron B: Correlation and Large-Scale Simultaneous Significance Testing. Journal of the American Statistical Association 2007, 102: 93–103.View ArticleGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001, 98(9):5116–5121.PubMed CentralView ArticlePubMedGoogle Scholar
- Li C, Wong WH: DNA-Chip Analyzer (dChip). In The analysis of gene expression data: methods and software. Edited by: Parmigiani G, Garrett ES, Irizarry R, Zeger SL. New York , Springer; 2003:120–141.View ArticleGoogle Scholar
- Ge Y, Dudoit S, Speed TP: Resampling-based Multiple Testing for Micorarray Data Analysis. Test 2003, 12: 1–77.View ArticleGoogle Scholar
- Richardson AL, Wang ZC, De Nicolo A, Lu X, Brown M, Miron A, Liao X, Iglehart JD, Livingston DM, Ganesan S: X chromosomal abnormalities in basal-like human breast cancer. Cancer Cell 2006, 9(2):121–132.View ArticlePubMedGoogle Scholar
- Barash Y, Dehan E, Krupsky M, Franklin W, Geraci M, Friedman N, Kaminski N: Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays. Bioinformatics 2004, 20(6):839–846.View ArticlePubMedGoogle Scholar
- Wu Z, Irizarry RA: Preprocessing of oligonucleotide array data. Nat Biotechnol 2004, 22(6):656–8; author reply 658.View ArticlePubMedGoogle Scholar
- Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature 2000, 406(6797):747–752.View ArticlePubMedGoogle Scholar
- Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Eystein Lonning P, Borresen-Dale AL: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 2001, 98(19):10869–10874.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang X, Lu X, Shi Q, Xu XQ, Leung HC, Harris LN, Iglehart JD, Miron A, Liu JS, Wong WH: Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics 2006, 7: 197.PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioural Statistics 2000, 25: 60–83.View ArticleGoogle Scholar
- Barabasi AL, Albert R: Emergence of scaling in random networks. Science 1999, 286(5439):509–512.View ArticlePubMedGoogle Scholar
- Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5(2):101–113.View ArticlePubMedGoogle Scholar
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks. Science 2002, 297(5586):1551–1555.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.