Volume 6 Supplement 2
Genome-wide estimation of gender differences in the gene expression of human livers: Statistical design and analysis
© Delongchamp et al; licensee BioMed Central Ltd. 2006
Published: 15 July 2005
Gender differences in gene expression were estimated in liver samples from 9 males and 9 females. The study tested 31,110 genes for a gender difference using a design that adjusted for sources of variation associated with cDNA arrays, normalization, hybridizations and processing conditions.
The genes were split into 2,800 that were clearly expressed (expressed genes) and 28,310 that had expression levels in the background range (not expressed genes). The distribution of p-values from the 'not expressed' group was consistent with no gender differences. The distribution of p-values from the 'expressed' group suggested that 8 % of these genes differed by gender, but the estimated fold-changes (expression in males / expression in females) were small. The largest observed fold-change was 1.55. The 95 % confidence bounds on the estimated fold-changes were less than 1.4 fold for 79.3 %, and few (1.1%) exceed 2-fold.
Observed gender differences in gene expression were small. When selecting genes with gender differences based upon their p-values, false discovery rates exceed 80 % for any set of genes, essentially making it impossible to identify any specific genes with a gender difference.
Liver toxicity is the most common adverse event associated with the introduction of a new drug despite extensive pre-clinical toxicity testing. The failure to predict this toxicity is attributed to differences among species in the metabolism and disposition of certain chemicals and drugs. This spawns an interest in in vitro tests that use human hepatocytes. The scarcity of primary human hepatocytes and the unsuitability of many human cell lines, which were derived from liver cancer cells, are a serious limitation for developing such tests. In addition, primary hepatocytes differentiate quickly in culture, restricting their use to short term studies . The National Center for Toxicological Research (NCTR) embarked upon a program to develop conditionally immortalized cell lines as potential in vitro models to evaluate the liver toxicity of new drugs [2, 3]. A further incentive for this approach is the potential to study mechanisms of liver toxicity from different genders and/or ethnic populations. As a start and proof of principle, a study was proposed to develop and characterize conditionally immortalized human primary hepatocyte cell lines from female and male donors.
It is desirable to quantify in some way that an immortalized cell line retains functions characteristic of primary hepatocytes. Since cDNA arrays can screen thousands of genes for expression differences, they are attractive as an evaluation tool. Once some immortalized cell lines are characterized with respect to primary hepatocytes, cDNA arrays could monitor the persistence of gender differences in expression across these immortalized cell lines, thereby showing retention of some functions without using primary hepatocytes. Toward these ends, this study scanned the genome for expression differences in the livers between 9 males and 9 females.
The use of cDNA arrays to assay for gender difference encounters two statistical problems that are not simple to deal with. First, gender comparisons relying upon array technologies are subject to biological and technical sources of variation . The experimental design needs to avoid confounding technical variation within treatments. This study modified an experimental design, which was previously employed for two-dye per array hybridizations , to 33P-labeled filter arrays that are stripped and reused. Second, there is a potential for excessive false positive rates because of the number of genes evaluated. For our purposes, a set of genes is needed for which the false positive rate is acceptably low. The false positive rates associated with potential sets were evaluated using recently developed post hoc methods based upon the empirical distribution of the observed p-values [6, 7].
Segments of human liver were obtained from Dr. Fred Kadlubar, Division of Molecular Epidemiology, NCTR (Jefferson, AR) and Dr. Steven Strom, the University of Pittsburgh (Pittsburgh, PA). This project was approved by the Research Involving Human Subjects Committee of the Food and Drug Administration (FDA). Nine pairs, a male paired with a female, were formed from available subjects. Pairs were processed concurrently to control variation from technical sources associated with sample preparation and measurement, and this was the major reason for pairing subjects. In forming pairs, we also attempted to match the age, race, and smoking/drinking habits of subjects as much as possible. However, this matching was not rigorous. The age of each subject was known (range: 25–58). For some subjects, information concerning their race (Caucasian and Hispanic) and their smoking and/or drinking habits was available although this information was not complete. In addition, several of these subjects are known to have died in a hospital where they were administered drugs in a failed attempt to stabilize their condition.
Total RNA was isolated from each liver sample using TRIzol (Life Technologies, Rockville, MD) according to the manufacturer's recommendations. Purified RNA was then treated with DNAse to remove residual DNA contamination. One tenth volume of 10X DNAse buffer (0.4 M Tris-HCl, pH 7.9; 0.1 M NaCl, 60 mM MgCl2, 1.0 mM CaCl2) was added to a reaction containing 50 to 100 μg of purified RNA and one unit RQ1 RNAse-free DNAse (Promega, Madison, WI). The reaction was then incubated for 15 min at 37°C, extracted once with an equal volume of phenol/chloroform, precipitated with ethanol and finally resuspended in RNAse-free dH2O. RNA yields were determined by spectrophotometric analysis. RNA integrity was confirmed by gel electrophoresis.
The filters used in this experiment were GF200, GF201, GF203, GF204, GF205 and GF206 (Invitrogen, Carlsbad, CA). Each filter was spotted with 5,184 expressed sequence tags (ESTs) representing human genes.
Filter Array Analysis
Pre-wetted filters were prehybridized at 42°C for 2 hr in 0.75 M NaCl, 0.17 M NaPO4 buffer (pH 7.0), 0.15 M Na4P2O7 • 10 H2O, 5X Denhardt's solution, 2.0% SDS, 100 μg/ml denatured salmon sperm DNA, 50% formamide and 5.0 μg human Cot-1 DNA. Five micrograms of total RNA were combined with 2.0 μl of 1.0 μg/ml oligo dT primer (LifeTechnologies) in a total volume of 10.0 μl and then incubated for 10 min at 70°C. The reaction was quick-chilled on ice for 2 min. The following components were added to the oligo dT-primed RNA; 6.0 μl of 5X first strand buffer (Clontech, Palo Alto, CA), 1.5 μl of 20 mM dNTPs (dGTP, dTTP, dCTP) (Invitrogen), 1.0 μl of 0.1 M DTT, 1.5 μl of PowerScript reverse transcriptase (RT) (Clontech) and 10 μl of α-33P dATP (>3000 Ci/mmol) (ICN, Irvine, CA). The reaction was incubated for 90 min at 42°C. Unincorporated nucleotides were removed by column purification using Bio-spin 6 columns (Biorad, Hercules, CA). Incorporation of label for all targets ranged ± 20% from the mean. The radiolabeled target was denatured by boiling for 3 min and added to 5 ml of prehybridization solution. The filters were hybridized with the denatured target for 18–20 hr at 42°C. After hybridization, the filters were washed twice in 2X SSC, 1% SDS at 68°C for 30 min and twice in 0.5X SSC, 0.5% SDS at 68°C for 30 min
The washed, hybridized filters were sealed in plastic sheet protectors and exposed on a Molecular Dynamics phosphor-screen (Amersham, Piscataway, NJ) for 72 hr. The screens were imaged with a Storm Phosphorimager (Amersham) at a resolution of 50 microns. The median pixel intensity for each spot was determined using ArrayVision software (Research Imagine, Ontario, Canada). Each filter was stripped after imaging as recommended by the manufacturer. Briefly, boiling 0.5% SDS was poured into a large glass dish. Hybridized filters were placed in the hot solution and agitated for one hour without additional heating. The filters were then imaged for four hr on a phosphor screen (Amersham Biosciences) at a resolution of 200 microns. Each filter was stripped five times.
Analysis of covariance model that was applied to the data for an interrogated spot. This table gives the typical degrees of freedom and the approximate expected mean squared errors for the sources of variation estimated under the experimental design. The reported F-ratios are the median of the 32,112 analyses. This median is the vertical red line in the box of Figure 1. The p-value is for the tabled F-ratio with numerator degrees of freedom as tabled and 34 denominator degrees of freedom.
Expected Mean Squared Error
σ2 + Q (Time, Median)
σ2 + Q (Time)
Analysis of Covariance
An analysis of covariance was fit to the log-intensity data for each spot. No background correction was applied. This model estimated the difference between the male subject's log-intensities and the female subject's log-intensities, i.e., this analysis produced an estimate of the sex difference in every block. These 9 estimates are adjusted for the factors in the experimental design and they are normalized by the median. In addition, a similarly adjusted and normalized average magnitude of the log-intensities was estimated.
Table 1 is a typical analysis of variance table for the statistical model that was fit to each spot. The median of all the spot intensities that were observed at each array-hybridization-time was first computed. These medians were entered as covariates to normalize the log-intensities  and their effect in the model is similar to a global normalization. The least squares estimate of the difference in log-intensities between the male sample and female sample was computed for each block. These estimates are normalized by the median and they are also adjusted for the main effects of time, block, array, and hybridization. An 'average adjusted log-intensity' was also computed for each spot. These estimates are the least squares means evaluated at 'median = 10' and the average levels of the categorical factors. This analysis was implemented using 'PROC GLM' .
Within each array type (5352 spots) and block, the estimates of the gender difference were plotted against their estimate of the average magnitude, e.g., Figure 2. Observed trends were removed by Loess regression using 'PROC LOESS' . This program allows one to specify several parameters that govern the degree of smoothing. Herein, we selected a quadratic equation, a bandwidth of 10% of the data (about 500 observations), and applied the smoothing algorithm three times to mitigate the influence of 'outliers'. Other combinations of these smoothing parameters were also tried, and they did yield differences in details, which we elaborated in the Discussion.
Selecting Expressed Genes
The adjusted average log-intensity for many of the interrogated genes evidenced little if any expression. Genes that are not expressed cannot be differentially expressed. So, it is useful to separate "expressed" genes from "not expressed" genes in analyses. Genes were partitioned into "expressed" and "not expressed" groups based on their adjusted average log-intensity. Essentially, high intensities are unlikely to have resulted from cross hybridization or other background sources, while low intensities are likely to represent a substantial amount of cross hybridization. The empirical distribution of the adjusted average log-intensity estimates was examined in a normal probability plot to determine a reasonable cut point. In liver samples that do not contain any mRNA matching a spotted cDNA sequence, i.e., a gene that is not expressed, the observed hybridization log-intensity is a background level arising from cross hybridization plus measurement error. When a large number of genes are not expressed in all of the liver samples, their intensities being of similar magnitude produce an obvious mode at the low end of the empirical distribution. Values less than this mode are assumed to arise entirely from "not expressed" genes. We also assumed that the distribution is symmetric about this mode and approximated this component of the empirical distribution by a normal distribution , which can be estimated directly from the normal probability plot. The genes were partitioned into "expressed" genes and "not expressed" genes based on a cutoff, which gives a low probability that larger values arise from the normal distribution. Genes with values greater than the cutoff were classified as "expressed" and the rest were classified as "not expressed".
Selecting Genes with Gender Differences in Expression
A few genes were spotted more than once. The study examined 32,112 spots representing 31,110 genes (distinct Gene Bank accession numbers). The analysis of covariance/Loess regression generated 9 smoothed estimates of log-fold changes for each spot. Estimates from replicated spots were averaged so that there was one estimate per block and gene. Likewise, replicated estimates of the adjusted average log-intensities were averaged. This resulted in 9 estimates of the gender difference and an estimate of the average magnitude for each interrogated gene.
There were 31,110 tests of the hypothesis that there was no gender effect. Simply selecting genes where the p-values are less that 0.05 would lead to an excessive number of false positives. Our strategy for dealing with the false-positive problem is elaborated elsewhere . P-values order genes according to the evidence for the null hypothesis. Genes having gender differences in expression are more likely to have small p-values and this is seen in a departure of their empirical distribution from its uniform expectation under the null [7, 12]. Herein, the observed distribution was assumed to be a mixture distribution with a proportion of the values having a uniform distribution, i.e., no gender difference in the expression, and the remainder having a Beta distribution, i.e., sexes differed. The mixing proportion and Beta parameters were estimated by maximizing the likelihood of this mixture distribution . The estimated mixture components were used to estimate false discovery rates for subsets of genes classified as 'having a gender difference' because their p-value is less than a specified value .
This study interrogated 32,112 cDNA spots using six types of arrays, each having 5352 spots. For each spot, the data are typically 72 observations, i.e. log-intensities from 9 blocks × 2 arrays × 2 hybridizations (hyb) × 2 development times. With no missing values, there would be 32,112 × 72 (2,312,064) observations. About 2 % of the data was discarded because the quality of the image from the phosphorimager was judged to be unsatisfactory, 48,168 observations: data from 6 of the 16 hr development times and 3 of the 72 hr development times. All block-array-hybridization combinations have data from at least one development time and all blocks have at least 7 observations out of the 8 that were planned for. Thus, gender differences were estimable for all blocks and genes.
Scatter plots of the accepted data suggest that there may be a few outlier values on some arrays. We did not attempt to remove 'outliers'. Most occurred in spots with background levels of expression and any apparent gender differences in these genes were classified as false positives because the gene was essentially 'not expressed' in this study. The few wider confidence bounds seen in Figure 6 may represent a contribution from an outlier.
Analyses of Covariance
In the analysis of covariance, least squares estimates of the logarithm (base 2) of the fold-change in gene expression, males/females, were computed for each block and spot. These are hereafter referred to as "estimated log-fold changes". Likewise, the least squares estimates of the expected log-intensity evaluated at median = 10 and the mean levels of the other factors, i.e., block, time, hyb, and array, were computed for each spot. We refer to these estimates as "adjusted average log-intensities".
Adjustment by Loess regression
The upper panel of Figure 2 plots estimated log-fold changes from block 5 and array type GF201 (5352 spots) against the adjusted average log-intensities. This figure is representative of trends observed over blocks and array types in the sense that the estimates exhibit systematic deviations from a horizontal line at 0 and these deviations tend to affect all spots within a neighborhood. The magnitude and direction of these deviations differ by block and array type. So, it is unlikely that these deviations represent gender differences. Such trends were removed by Loess regression computed within each block and array type, e.g., lower panel of Figure 2, yielding smoothed estimated log-fold changes.
Selecting expressed genes
Selecting genes with gender differences in expression
The average of the Loess-smoothed estimates (black) and their 95% confidence bounds (blue) are plotted for the 'expressed' genes in Figure 6. The horizontal axis in this plot is the rank of the average among the 2800 'expressed' genes. All of the observed gender differences in gene expression are small. Essentially all of the point estimates (99.7 %) are within the interval, [-0.5, 0.5]. That is, observed fold changes were less than . Furthermore, 79.3 % of the 95 % confidence intervals are within the interval, [-0.5, 0.5] and most (98.9 %) are within the interval, [-1,1]. Only 27 genes have confidence intervals that are not within the interval, [-1,1]. In concept, these 27 genes could have gender differences larger than 2 fold. However, wide bounds usually reflect an outlier observation, which inflates the standard deviation.
This study aimed to identify genes in human livers for which expression differed between the sexes. Expression was assayed for 31,110 genes of which 2,800 were classified as "expressed". The remainder of the interrogated genes had observed expression intensities that were indistinct from background, and they are classified as "not expressed". However, most of the genes for the cytochrome P450 enzymes were in this "not expressed" subset, which suggests that detection was limited by background and/or the sensitivity of the labeled target. Evidence of a gender difference was evaluated by a t-test, and a mixture model was fit to the observed p-values from these tests (Figure 4). This model estimated that expression in 8 % [95 % CI: 7 % to 9 %] of the 2,800 "expressed" genes differed by gender. However, the ability of this study to identify specific genes is poor with estimated false discovery rates exceeding 80 % for every partition of the 2,800 genes (Figure 5).
The study estimated the relative difference in expression for all interrogated genes. These estimates were plotted for the "expressed" genes (Figure 6). All estimates are less than 1.55 fold and they generally have narrow confidence intervals. Any gender differences that might be detectable through these arrays would be small, which excludes their anticipated use to monitor the persistence of gender differences across immortalized cell lines.
The experiment was designed to adjust estimated gender differences for several sources of variation (Table 1). Since this design made twice as many measurements – two arrays per sample, this yielded more precise estimates of gender differences than would have been possible with one array per liver sample. This design was proposed because we had access to a limited number of liver samples, and arrays were relatively inexpensive. Conceptually, one could have used the same number or arrays with twice as many samples. Figure 1 showed that components of variation associated with arrays and hybridizations are somewhat larger than the variation associated with subjects. These components do not impact the variation in the estimated gender difference under the implemented design. However, they would under a one array per sample design, and any precision gained in estimating the subject component would be offset by the addition of array and hybridization components. Because these two designs would estimate the respective variances of the gender difference with different degrees of freedom, the better of the two designs would depend on the number of blocks. In all but cases with a few blocks, the implemented design would be better.
Figure 2 illustrates a case where there was a systematic trend in the least squares estimates. There were substantial trends in about half of the block – array type combinations. However, the needed correction was not similar either within an array type or within a sample block. Consequently, estimated gender differences would be much more variable without the Loess smoothing. We suspect that these trends are caused by spatial variation in binding due to intrinsic properties of membranes and/or irregular distribution of labeled solution, besides it seems unlikely that real gender differences would exhibit this behavior. Smoothing eliminates much of this variation, but it also reduces the estimated difference. We tried several smoothing levels ranging from no smoothing to very local smoothing. Figure 2 represents an intermediate level of smoothing, which was adopted for this report as a representative example. The smoothing levels that we tried gave similar distributions for the p-values, and the same general conclusions would be reached. Namely, there is evidence of a few small changes but the specific genes cannot be identified. However, the rank of a specific gene in Figure 6 or the genes included in a 'significant' set (Figure 5) depends far too much on decisions about the smoothing parameters.
The estimate that 8 % of the "expressed" genes have gender differences relies on the assumption that the p-values from genes with no gender difference in expression have a uniform distribution. Since these p-values are based on a t-test, this assumption requires that the mean gender difference for each gene have a normal distribution. This is not unreasonable on theoretical grounds. Further, bootstrap estimates are largely within the expected range if they are considered as estimates of the t-test's p-values (data not shown). That is, the assumption of a symmetric distribution under the null hypothesis gives a distribution for the mean that is well approximated by the t-distribution.
More problematic is the assumption that the 2,800 tests are independent. Correlations are induced through shared conditions by all genes on an array, the normalization step, and the Loess regression step. Further, expression levels among some genes are expected to be correlated because they work in concert to achieve a specific cellular structure or function. Simulation studies have shown that the estimated number of affected genes is not biased by correlations among tests, but correlations increase the variance of this estimate substantially [13, 14]. The reported confidence bounds on the proportion of affected genes assume independence and likely underestimate the actual variation, possibly by a substantial amount. Consequently, the statistical significance of the 8 % estimate is not clear.
P-values were computed for all interrogated genes. Genes that are not expressed should not have intensities that depend on sex. The p-values for the "not expressed" subset show little departure from the diagonal (Figure 4), which is "as expected" if they arise under the null distribution. These genes also have correlations induced by shared conditions, normalization and Loess regression. Apparently, these correlations were insufficient to disrupt the uniform distribution for these p-values.
We estimated that the gene expression of 224 genes differed between sexes. The observed gender differences in expression were small. False discovery rates exceed 80 % for every set of genes selected by their p-values, essentially making it impossible to identify any specific genes with a gender difference.
This study was funded by the Food and Drug Administration's Office of Women's Health and the National Center for Toxicological Research. The normal human livers used in this study were obtained through the Liver Tissue Procurement and Distribution System, Pittsburgh, Pennsylvania which was funded by NIH Contract #N01-DK-9-2310. CV was supported by an Oak Ridge Institute of Science and Education (ORISE) fellowship at NCTR during the early development of this research.
- Guillouzo A: Liver cell models in in vitro toxicology. Environmental Health Perspectives 1998, 2: 511–532.View ArticleGoogle Scholar
- Kobayashi N, Noguchi H, Fujiwara T, Tanaka N: Establishment of a reversibly immortalized human hepatocyte cell line by using cre/loxp site-specific recombination. Transplantation Proceedings 2000, 32: 1121–1122. 10.1016/S0041-1345(00)01154-4View ArticlePubMedGoogle Scholar
- Westerman KA, Leboulch P: Reversible immortalization of mammalian cells mediated by retroviral transfer and site-specific recombination. Proceedings of the National Academy of Sciences of the United States of America 1996, 93: 8971–8976. 10.1073/pnas.93.17.8971PubMed CentralView ArticlePubMedGoogle Scholar
- Yang YH, Speed T: Design issues for cDNA microarray experiments. Nature Reviews Genetics 2002, 3: 579–588.PubMedGoogle Scholar
- Desai VG, Moland CL, Branham WS, Delongchamp RR, Fang H, Duffy PH, Peterson CA, Beggs ML, Fuscoe JC: Changes in expression level of genes as a function of time of day in the liver of rats. Mutation Research – Fundamental & Molecular Mechanisms of Mutagenesis 2004, 549: 115–129. 10.1016/j.mrfmmm.2003.11.016View ArticleGoogle Scholar
- Allison DB, Gadbury GL, Heo M, Fernandez JR, Lee C-K, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data. Computational Statistics and Data Analysis 2002, 39: 1–20. 10.1016/S0167-9473(01)00046-9View ArticleGoogle Scholar
- Delongchamp RR, Bowyer JF, Chen J, Kodell RL: Multiple-testing strategy for analyzing cDNA array data on gene expression. Biometrics 2004, 60: 774–782. 10.1111/j.0006-341X.2004.00228.xView ArticlePubMedGoogle Scholar
- Coombes KR, Highsmith WE, Krogmann TA, Baggerly KA, Stivers DN, Abruzzo LV: Identifying and quantifying sources of variation in microarray data using high-density cDNA membrane arrays. Journal of Computational Biology 2002, 9: 655–670. 10.1089/106652702760277372View ArticlePubMedGoogle Scholar
- Parrish RS, Delongchamp RR: Normalization. In DNA microarrays and statistical genomic techniques: Design, analysis, and interpretation of experiments. Edited by: Allison DB, Page GP, Beasley TM, Edwards JW. New York: Marcel Dekker, Inc; 2005:in press.Google Scholar
- Sas/stat users guide, version 8. Cary, NC: SAS Institute Inc; 1999.
- Lee MLT, Kuo FC, Whitmore GA, Sklar J: Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proceedings of the National Academy of Sciences of the United States of America 2000, 97: 9834–9839. 10.1073/pnas.97.18.9834PubMed CentralView ArticlePubMedGoogle Scholar
- Schweder T, Spjotvoll E: Plots of p-values to evaluate many tests simultaneously. Biometrika 1982, 69: 493–502. 10.2307/2335984View ArticleGoogle Scholar
- Tsai C-A, Hsueh H-m, Chen JJ: Estimation of false discovery rates in multiple testing: Application to gene microarray data. Biometrics 2003, 59: 1073–1083. 10.1111/j.0006-341X.2003.00123.xView ArticleGoogle Scholar
- Hsueh H-m, Chen JJ, Kodell RL: Comparison of methods for estimating the number of true hypotheses in multiplicity testing. Journal of Biopharmaceutical Statistics 2003, 13: 675–689. 10.1081/BIP-120024202View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.