Determining gene expression on a single pair of microarrays
© Reid and Fodor; licensee BioMed Central Ltd. 2008
Received: 07 February 2008
Accepted: 21 November 2008
Published: 21 November 2008
In microarray experiments the numbers of replicates are often limited due to factors such as cost, availability of sample or poor hybridization. There are currently few choices for the analysis of a pair of microarrays where N = 1 in each condition. In this paper, we demonstrate the effectiveness of a new algorithm called PINC (PINC is Not Cyber-T) that can analyze Affymetrix microarray experiments.
PINC treats each pair of probes within a probeset as an independent measure of gene expression using the Bayesian framework of the Cyber-T algorithm and then assigns a corrected p-value for each gene comparison.
The p-values generated by PINC accurately control False Discovery rate on Affymetrix control data sets, but are small enough that family-wise error rates (such as the Holm's step down method) can be used as a conservative alternative to false discovery rate with little loss of sensitivity on control data sets.
PINC outperforms previously published methods for determining differentially expressed genes when comparing Affymetrix microarrays with N = 1 in each condition. When applied to biological samples, PINC can be used to assess the degree of variability observed among biological replicates in addition to analyzing isolated pairs of microarrays.
Numerous strategies have been devised to come up with rigorous methods for detecting differentially expressed genes in sets of microarray data (for review see). The majority of microarray analytical methods require N = 3 in each condition in order to perform statistical measures. Due to the expense of microarrays, experimental imperfections such as poor hybridization and limited quantities of available biological sample source, it is not always possible to obtain the required sample sizes.
On an Affymetrix expression array, such as the HG-U133A GeneChip, each gene is represented on the array by a number of separate 25 mer probes that correspond to a part of the gene sequence. Many popular statistical methods including MAS5, RMA and GCRMA summarize probes into a single value for the entire probeset before performing statistical inference. In contrast, probe-level modeling has been used by the software packages affyPLM for quality-control purposes. There are also a number of statistical models that directly utilize probe information in statistical inference including Logit-T , Fisher's combined p-value , gMOS , and multi-mgMOS . Despite performing inference on probes, rather than on probesets, these methods still require multiple experiments (N ≥ 3) in each condition.
In this paper, we explore the idea that it should in principle be possible to use the high number of probes in each probe set to substitute for repeat experiments. That is, instead of using repeated chips to estimate the variance for statistical inference, can we exploit the existence of multiple probes per probe set to estimate the variance? Previously, Hein and Richardson have used a Bayesian hierarchical model to estimate expression levels from probe level data allowing for analysis with N = 1 in each condition. In their algorithm, called BGX, inference is performed at each stage of analysis (background correction, gene expression estimation and differential expression). The BGX algorithm outperformed existing probe level metrics such as the Wilcoxon test statistic and another Bayesian algorithm EBarrays. Because the BGX algorithm requiring a Markov chain Monte Carlo (MCMC) model for each stage of microarray analysis it is a very computationally demanding technique. As an alternative, we introduce an algorithm called PINC (PINC is not Cyber-T) based on the Cyber-T algorithm first described by Bali and Long and a method we recently described for generating accurate p-values. We show that PINC has attractive characteristics when compared to Cyber-T, BGX and other methods of performing inference on Affymetrix microarrays at low sample sizes.
Results and discussion
The Performance of Test Statistics in ranking genes on a control data set at N = 1
On the Affymetrix Latin Square HG-133A dataset, there are 11 probes per probeset. Given 11 independent measures in two samples, there are a variety of statistical tests available to evaluate the null hypothesis for each gene that the expression observed in each sample is identical. These include the Standard-T test, a paired t test (which is equivalent to a two way ANOVA in which the independent variables are probe and sample) and the Wilcoxon test (a non-parametric equivalent to a paired-T test). In addition to these canonical statistical tests, there are variants of the t-test specifically designed for microarrays. These include the paired and unpaired Cyber-T tests  in which the variance for each gene is an estimate based on an average of the canonical variance for that gene and a background variance of other genes with similar intensities on each array (see methods).
While the data in Figure 1A give a broad overview of how the algorithms perform, the scale of the x-axis does not represent a biologically useful signal. For example, at a false positive rate of 0.05 a gene list for the HG-133A microarray would have over 1,000 false positives. Clearly such a gene list is not that useful. To better explore a more biologically relevant cutoff, in which a gene list consists of mostly true positives, Figure 1B shows the same data as in Figure 1A, but with the x-axis scaled to show only gene lists that include a small number of false positives. Figure 1C shows the number of true positives captured at a cutoff of N = 4 false positives (Figure 1B dashed vertical line) for all 13 comparisons. At this more stringent cutoff the paired and unpaired Cyber-T tests clearly outperform the other statistics.
The Performance of Test Statistics in Providing Accurate p-Values for Inference
ROC curves rank all of the genes in an experiment but generating a gene list in a "real" experiment also requires choosing a cutoff point. That is, it is not enough to rank genes into an ordered list, one must know how many genes to consider significant from the list. Each test statistic generates a score for each gene and we wish to determine the threshold score above which genes are considered to be significantly differentially expressed. This has proven to be a challenging problem. In the microarray literature it is generally accepted that family-wise error rates, such as Bonferroni correction, are too conservative in an effort to prevent type-I errors thereby producing an abundance of type-II error [16, 17]. The use of false discovery rates (FDR) has become a popular alternative for controlling error rates (for a review, see ).
In this study, we evaluated the performance of different test statistics using the Benjamini and Hochberg (hereafter BH) and Benjamini and Yekutieli (hereafter BY) FDR cutoff levels (see methods), as well as the Holm's step down method, a more conservative family wise error rate correction algorithm ( and see methods). For the FDR algorithms, we set the cutoff level at 10%, i.e., we are willing to accept that 10% of the genes considered to be significant will be false positives. For the Holm's step down FWER, we set a cutoff level of 0.05 divided by N (22,223) for the highest scoring gene pair. Then for each subsequent gene, the cutoff is recalculated as 0.05 divided by the number of remaining genes.
We have previously shown that, when applied at the probeset level, p-values produced by canonical statistics and the unpaired Cyber-T test are not very accurate on control Affymetrix datasets . We proposed as a simple alternative, a method that assumes that all the background values on a microarray form a single distribution ( and see methods). We here propose a new algorithm PINC (PINC Is Not Cyber-T), which is the paired Cyber-T test performed at the probe level in which the p-values provided by the Cyber-T test are replaced with p-values generated by this assumption of a single background distribution. Applying the PINC algorithm yields a list in which the rank order is identical to the paired Cyber-T test (and therefore would have the same ROC profile in Figure 1) but the p-values differ. In Figure 2, we see that p-values generated by PINC do a better job of controlling FDR under both BH and BY FDR; the sensitivity of PINC in nearly as good as the sensitivity shown by Cyber-T paired, but the specificity is much closer to the expected level of 0.9. Indeed, no matter which of the three cutoff schemes we used to determine the threshold p-value of significance, the PINC algorithm nicely balanced sensitivity and specificity picking up a substantial fraction of true positives with a minimal number of false positives (Figure 2). All other algorithms perform poorly on either sensitivity and specificity suggesting that p-values calculated with these algorithms are either inappropriately large or inappropriately small. We conclude that when compared to other algorithms, the p-values produced by the PINC algorithm lead to inference that is less susceptible to bias introduced by the method of determining the threshold cutoff. That is, we argue that the p-values produced by PINC are more robust than p-values produced by the Cyber-T software or by canonical statistical tests.
Consistency in technical and biological replicates
Our results suggest that, at least on the technical replicates of the Latin Square experiment, the PINC statistic produces p-values that allow for correct inference in discriminating true and false positives. Because the p-values generated in 1 vs. 1 comparisons do not involve biological replicates, they cannot be used to evaluate biological variability; that is, they do not indicate the reliability of the observed difference in gene expression relative to biological noise across individuals. Rather, the p-values reflect the magnitude of the differences between the samples relative to technical variability that arise from hybridization noise, optical noise, differences in RNA degradation between the samples, artifacts that arise from probe selection and so forth. For the tightly controlled datasets such as the Latin Square dataset, the performance of the PINC algorithm at assigning p-values reflecting these sources of noise at N = 1 is clearly acceptable (Figure 2). However, what happens when we examine biological datasets in which biological noise, by necessity absent from the technical replicates that make up control datasets, makes up a significant component of the measured signal?
Figure 3 shows the results of these analyses of different sample sizes on the 13 Latin Square 2× comparisons. Figure 3A shows the number of true positives that can be recovered at an arbitrary cutoff of four false positives (similar to Figure 1C). Figure 3B shows the results of sensitivity and specificity after applying a BH-FDR cutoff of 10% (similar to Figure 2A). We see very similar results no matter if we use 1, 2 or 3 microarrays (conditions 1–3) or use a probeset analysis at N = 3 (condition 4). This confirms the observation of Klebanov and Yakovlev that noise derived from technical replicates is generally low and that PINC can yield results similar to a popular probeset algorithm such as Cyber-T despite the use of only one microarray.
We next applied PINC to a series of biological replicates with varying degrees of biological noise. We chose to analyze an Affymetrix dataset from a cell line study (Accession: NCBI Entrez Geo GDS756) that explored changes in gene expression of SW480, a primary colon cancer cell line  and an experiment extracted from human tissue with multiple human donors (Accession: GDS2191) that explores the regulation of the ubiquitin cycle in bipolar disorder . We reasoned that the biological noise in the human tissue dataset would be higher than the biological noise from the cell lines, while the cell lines would in turn have more noise than the technical replicates of the Latin Square experiment [see Additional file 1]. The experiments we chose all met the following criteria; the sample size needed to be at least N = 3, the datasets needed to be a control versus treatment type of design, the datasets needed to be based on the Affymetrix HG-U133A platform and the CEL files publicly available. Within each dataset with N>3, three microarrays for analysis were randomly chosen using a random number selection program http://www.random.org.
Experiments with few numbers of repeats are ineligible for analysis via most published microarray analytical methods. We have shown that when applying analysis at a probe level using PINC, we are able to generate reasonable results on control datasets at N = 1 in each condition. For paired single microarrays, PINC outperforms both canonical statistics and a recently published method while offering conceptually simple statistics and fast run-times. Because the p-values are derived from a distribution estimated from all of the genes on the array, PINC also avoids the large p-values usually associated with low sample size microarray experiments. This allows for the possibility of using a more conservative cut off criteria such as family wise error rate as an alternative to false discovery rate when selecting a p-value cutoff for selecting differentiated genes (Figure 2).
The success of the PINC algorithm in performing accurate inference on the Latin Square dataset at N = 1 suggests that there is little benefit to performing additional technical replicates. This is consistent with previous literature  as is our observation that one gets largely similar results whether one uses N = 1, N = 2 or N = 3 in ranking the 2× Latin Square experiments (Figure 3). The ability to analyze single Affymetrix experiments in a statistically rigorous way opens up the possibility of interesting analyses even for experiments in which multiple biological samples were collected. For example, in a cancer study in which cancer tissue is compared against non-cancer tissue from the same patient, we could generate gene lists consisting of genes that are differentially expressed at a given cutoff threshold for every patient in the study. This may yield very different insights than the usual practice of averaging the samples together and performing a single analysis to generate a single gene list. We know that diseases like cancer are very diverse with many different molecular mechanisms presenting similar clinical diagnostics. The ability to evaluate each patient individually in a statistically rigorous way may improve our understanding of the diverse causes of diseases such as cancer and may allow for better use of microarrays in personalized medicine.
where n is the sample size (the number of arrays in the condition), SD is the standard deviation as it is usually calculated, SDWindow is the average of the standard deviation of the 100 genes with the average intensity closest to the average intensity of the gene under consideration and Conf is an adjustable parameter set to 10 by default in the "v1.0beta" of the Cyber-T distribution for R http://cybert.microarray.ics.uci.edu. In a single chip treatment versus control experiment, n1 and n2 are equal to the number of probes for a particular gene and m1 and m2 are the averages of each group of probes.
For Affymetrix arrays, Cyber-T is typically used following summation of the probes into a single value for each probeset with an algorithm such as RMA (Examples can be seen in [24, 25]). As an alternative, PINC applies Cyber-T directly to probes within a probeset to determine gene expression scores. For a GeneChip such as the Affymetrix HG-U133A Array, each probe set contains 11 perfect match probes (we ignore mismatch probes). Thus for a single chip experiment (treatment versus control) PINC compares 11 probes in each position using the paired Cyber-T test (with N = 11).
The Cyber-T test generates a p-value for each gene evaluating the null hypothesis that the gene is identical in both conditions. Because the estimate for the variance of each gene is not independent but is instead dependent upon its neighboring gene scores, the authors of the Cyber-T do not expect the Cyber-T test to follow a simple t-distribution with n1+n2-2 degrees of freedom. Instead, the Cyber-T test assumes that Cyber-T scores will follow a t-distribution with 2 * Conf+n1 +n2 -2 degrees of freedom. We have previously shown that the p-values generated in this way are not very accurate.
To determine which genes are differentially expressed, PINC determines p-values by way of "Scheme 4" . Scheme 4 assumes that all the test statistic scores form a single normal distribution and then applies a "Statistical Level Normalization" step which corrects for systematic drift in the t-statistic away from a value of zero .
In summary, PINC takes the scores from the paired Cyber-T test at the probe level and uses "Scheme 4" to calculate the p-values rather than using the p-values reported by the Cyber-T software. In this paper, we refer to "Cyber-T" and "Cyber-T paired" as methods that act on the probe level but do not implement Scheme 4 to generate p-values. In our study, PINC is the only algorithm that has p-values generated by Scheme 4.
FDR and Family-Wise Error Rate algorithms
For the purposes of this analysis, we determined which genes were differentially expressed by either applying a 10% cut off rate via false discovery rates or performed multiple experiment correction via Holm's step down method  (p-value cutoff = 0.05).
The Benjamini and Hochberg algorithm (hereafter BH FDR) yields a predicted False Discovery Rate (FDR) for a given gene in a gene list ordered by statistic p-value:N*p(k)/k
Other statistical tests
At the probe level, we applied the student's Standard-T test (paired and unpaired), and Wilcoxon Rank Sum test. (The idea of applying the Wilcoxon test at the probe level to a single pair of microarrays was first suggested by Affymetrix). The BGX algorithm[10, 11] was also applied to the different datasets as a benchmark comparison.
For Cyber-T and BGX we used implementations in R using the Bioconductor package. All other statistical tests were implemented in Java (code available at http://www.afodor.net). Results for the Wilcoxon nonparametric test were generated from Java source code made publicly available by D. A. Nix http://rana.lbl.gov/~nix.
To assess the effectiveness of PINC, the HG-U133A Latin Square dataset was downloaded from Affymetrix . Two Probe sets with a number of probes other than 11 probes were discarded. For the Latin Square data sets, probesets 209374_s_at, 205397_x_at and 208010_s_at were excluded for all analyses as instructed by the HG-U133A_tag_Latin_Square.xls spreadsheet. We also excluded any probeset not in the spike-in probesets that started with AFFX-. This resulted in 42 true positives and 22,181 true negatives used for assessing effectiveness. The Affymetrix Latin Square dataset was analyzed using N = 1 for all 14 2× fold conditions taking the first experiment (i.e., the CEL file ending in R1) for each condition. CEL files were normalized as indicated using quantile normalization from dCHIP  or RMA Express http://rmaexpress.bmbolstad.com/[3, 28, 29] with background subtraction turned on (except for the BGX algorithm which performs its own normalization).
- Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 2006, 7(1):55–65. 10.1038/nrg1749View ArticlePubMedGoogle Scholar
- Affymetrix statistical algorithms description document[http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf]
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Wu Z, Irizarry R, Gentleman R, Murillo F, Spencer F: A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. Journal of the American Statistical Association 2004, 99: 909–917. 10.1198/016214504000000683View ArticleGoogle Scholar
- Bolstad B: affyPLM: Fitting Probe Level Models. Bioconductor Vignettes. 2006.Google Scholar
- Lemon WJ, Liyanarachchi S, You M: A high performance test of differential gene expression for oligonucleotide arrays. Genome Biol 2003, 4(10):R67. 10.1186/gb-2003-4-10-r67PubMed CentralView ArticlePubMedGoogle Scholar
- Hess AHI: Fisher's combined p-value for detecting differentially expressed genes using Affymetrix expression arrays. BMC Genomics 2007, 8: 96. 10.1186/1471-2164-8-96PubMed CentralView ArticlePubMedGoogle Scholar
- Milo M, Fazeli A, Niranjan M, Lawrence ND: A probabilistic model for the extraction of expression levels from oligonucleotide arrays. Biochemical Society transactions 2003, 31(Pt 6):1510–1512.View ArticlePubMedGoogle Scholar
- Liu X, Milo M, Lawrence ND, Rattray M: A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips. Bioinformatics 2005, 21(18):3637–3644. 10.1093/bioinformatics/bti583View ArticlePubMedGoogle Scholar
- Hein AM, Richardson S: A powerful method for detecting differentially expressed genes from GeneChip arrays that does not require replicates. BMC Bioinformatics 2006, 7: 353. 10.1186/1471-2105-7-353PubMed CentralView ArticlePubMedGoogle Scholar
- Turro E, Bochkina N, Hein AM, Richardson S: BGX: a Bioconductor package for the Bayesian integrated analysis of Affymetrix GeneChips. BMC Bioinformatics 2007, 8: 439. 10.1186/1471-2105-8-439PubMed CentralView ArticlePubMedGoogle Scholar
- Kendziorski C, Newton M, Lan H, Gould M: On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine 2003, (22):3899–3914. 10.1002/sim.1548Google Scholar
- Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001, 17(6):509–519. 10.1093/bioinformatics/17.6.509View ArticlePubMedGoogle Scholar
- Fodor AA, Tickle TL, Richardson C: Towards the uniform distribution of null P values on Affymetrix microarrays. Genome Biol 2007, 8(5):R69. 10.1186/gb-2007-8-5-r69PubMed CentralView ArticlePubMedGoogle Scholar
- Wilcoxon F: Individual Comparisons by Ranking Methods. Biometrics Bulletin 1945., 1(6):Google Scholar
- Cheng C, Pounds S: False discovery rate paradigms for statistical analyses of microarray gene expression data. Bioinformation 2007, 1(10):436–446.PubMed CentralView ArticlePubMedGoogle Scholar
- Pounds S: Estimation and control of multiple testing error rates for microarray studies. Briefings in Bioinformatics 2006, 7(1):25–36. 10.1093/bib/bbk002View ArticlePubMedGoogle Scholar
- Benjamini YHY: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc B 1995, 57: 289–300.Google Scholar
- Benjamini YYD: The control of the false discovery rate in multiple testing under dependency. Ann Stat 2001, 29: 1165–1168. 10.1214/aos/1013699998View ArticleGoogle Scholar
- Holm S: A simple sequentially rejective multiple test procedure. Scand J Statist 1979, 6(65):70.Google Scholar
- Klebanov L, Yakovlev A: How high is the level of technical noise in microarray data? Biol Direct 2007, 2: 9. 10.1186/1745-6150-2-9PubMed CentralView ArticlePubMedGoogle Scholar
- Provenzani A, Fronza R, Loreni F, Pascale A, Amadio M, Quattrone A: Global alterations in mRNA polysomal recruitment in a cell model of colorectal cancer progression to metastasis. Carcinogenesis 2006, 27(7):1323–1333. 10.1093/carcin/bgi377View ArticlePubMedGoogle Scholar
- Ryan MM, Lockstone HE, Huffaker SJ, Wayland MT, Webster MJ, Bahn S: Gene expression analysis of bipolar disorder reveals downregulation of the ubiquitin cycle and alterations in synaptic genes. Molecular psychiatry 2006, 11(10):965–978. 10.1038/sj.mp.4001875View ArticlePubMedGoogle Scholar
- De Mees C, Laes JF, Bakker J, Smitz J, Hennuy B, Van Vooren P, Gabant P, Szpirer J, Szpirer C: Alpha-fetoprotein controls female fertility and prenatal development of the gonadotropin-releasing hormone pathway through an antiestrogenic action. Mol Cell Biol 2006, 26(5):2012–2018. 10.1128/MCB.26.5.2012-2018.2006PubMed CentralView ArticlePubMedGoogle Scholar
- Sommer P, Le Rouzic P, Gillingham H, Berry A, Kayahara M, Huynh T, White A, Ray DW: Glucocorticoid receptor overexpression exerts an antisurvival effect on human small cell lung cancer cells. Oncogene 2007, 26(50):7111–7121. 10.1038/sj.onc.1210524View ArticlePubMedGoogle Scholar
- Affymetrix Latin Square Data[http://www.affymetrix.com/support/technical/sample_data/datasets.affx]
- Li CWWI: DNA-chip analyzer (dChip). New York: Springer; 2003:1–455.Google Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31(4):e15. 10.1093/nar/gng015PubMed CentralView ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249–264. 10.1093/biostatistics/4.2.249View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.