- Methodology article
- Open Access
Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach
- Jun Lu^{1},
- John K Tomfohr^{1} and
- Thomas B Kepler^{1}Email author
https://doi.org/10.1186/1471-2105-6-165
© Lu et al; licensee BioMed Central Ltd. 2005
- Received: 04 March 2005
- Accepted: 29 June 2005
- Published: 29 June 2005
Abstract
Background
In testing for differential gene expression involving multiple serial analysis of gene expression (SAGE) libraries, it is critical to account for both between and within library variation. Several methods have been proposed, including the t test, t_{ w }test, and an overdispersed logistic regression approach. The merits of these tests, however, have not been fully evaluated. Questions still remain on whether further improvements can be made.
Results
In this article, we introduce an overdispersed log-linear model approach to analyzing SAGE; we evaluate and compare its performance with three other tests: the two-sample t test, t_{ w }test and another based on overdispersed logistic linear regression. Analysis of simulated and real datasets show that both the log-linear and logistic overdispersion methods generally perform better than the t and t_{ w }tests; the log-linear method is further found to have better performance than the logistic method, showing equal or higher statistical power over a range of parameter values and with different data distributions.
Conclusion
Overdispersed log-linear models provide an attractive and reliable framework for analyzing SAGE experiments involving multiple libraries. For convenience, the implementation of this method is available through a user-friendly web-interface available at http://www.cbcb.duke.edu/sage.
Keywords
- Library Size
- Deviance Test
- Iteratively Reweighted Little Square
- Overdispersion Parameter
- Satterthwaite Approximation
Background
Serial analysis of gene expression (SAGE) is used to measure relative abundances of messenger RNAs (mRNAs) for a large number of genes [1, 2]. Briefly, mRNAs are extracted from biological samples and reverse-transcribed to cDNAs. The double-stranded cDNAs are then digested by a 4-cutter restriction enzyme (anchoring enzymes, usually NlaIII). After digestion, another restriction enzyme (tagging enzymes) is used to release the downstream DNA sequences at 3' of most of the anchoring enzyme restriction sites. The released sequences, usually 10–11 base pairs (bp) long, are called SAGE tags. The tags derived from many different species of mRNAs can be concatenated, cloned and sequenced. In a typical SAGE experiment, a large number of tags (often ranging from 30,000 to 100,000) are collected from each sample, with each tag representing, ideally, one gene; the tag count indicates the transcription level of the gene represented by that specific tag. A natural question of interest is whether a given tag is differentially expressed. Over the past few years, SAGE has been extensively used for expression analysis of cancer samples for identifying diagnostic or therapeutic targets [3, 4].
Most SAGE studies focus on comparing expression levels between two samples. For such two-library comparisons, several statistical methods have been proposed, such as the simulation method of Zhang et al. [2], the Bayesian approaches [5–7], and the normal approximation based z-test [8] (which is equivalent to the chi-square test [9]). A comparative review by Ruijter et al. [10] has shown that all these methods perform equally well.
The comparison between two SAGE libraries can identify biologically interesting tags (or genes). However, in many cases it is essential to conduct experiments with replicates in order to account for normal background biological variation. For experiments involving multiple SAGE libraries, between-library variation beyond the binomial sampling variation is introduced. Such between-library variation can be due to additional known factors involved in the experimental design, as well as to unknown genetic or environmental variation between observations. Indeed, major differences in gene expression exist among SAGE libraries prepared from the same tissues of different individuals [11]. Statistical methods are needed for analyzing SAGE experiments involving multiple libraries. In the case of two-group comparisons (e.g. comparisons between a normal group and a cancer group), methods such as pooling the libraries in each group and transforming to two-library comparisons (for example, using the chi-square test), or the two-sample t- test on proportions have been proposed and discussed [12–14]. The pooling approach is often problematic since it ignores gene expression variation among libraries within the same treatment group, which leads to biased estimates for the variance. The two-sample t- test on proportions, however, can be problematic as well; proportions estimated from libraries with smaller sizes are known to be more variable than those from larger libraries.
For two-group comparisons, Baggerly et al. introduced a test statistic, t_{ w }, based on a hierarchical beta-binomial model to account for both between-library and within-library variation [13]. The t_{ w }test statistic is assumed to have an approximate t- distribution and like the t- test, the t_{ w }-test is only good for two-group comparisons. For SAGE experiments with a more general design (e.g. involving 2 or more factors), an approach based on overdispersed logistic regression has been described [15]. Overdispersed models aim to allow for the possibility of overdispersion in the tag counts, i.e., cases where the variance in tag counts exceeds what is expected for binomial or Poisson sampling alone. Besides its flexibility in modeling multiple factors and/or continuous covariates, logistic regression compares group proportions on a logit scale (log of odds) rather than a raw scale as in the t and t_{ w }tests. Comparing groups in logistic regression (and any generalized linear model) is equivalent to testing the hypotheses of whether the coefficients β = 0. Baggerly et al. [15] showed evidence suggesting that "the logit scale may be more appropriate" than the original proportion scale. One drawback with overdispersed logistic regression, however, is that it can break down for cases where all the tag counts in any of groups are very small. In such cases, the deviance test rather than the t-test (on the hypothesis that the coefficient β is zero) has been proposed [15]. Besides that a systematic evaluation of the deviance test is needed, a potential drawback with the deviance test is that it may require multiple rounds of model fitting if a model contains multiple factors or covariates. Furthermore, questions still remain on exactly when the deviance test should be used in preference to the t-test.
In this report we introduce an overdispersed log-linear model approach to analyzing SAGE which is closely related to overdispersed logistic regression but has a different mean-variance relationship assumption. We compare its performance in identifying differential expression with that of three other methods, including the t-test, t_{ w }test and overdispersed logistic regression. Analysis of simulated and real datasets show that both the log-linear and logistic overdispersion methods generally perform better than the t and t_{ w }tests. Based on simulated data, the log-linear method is found to have better performance than the logistic method, showing equal or higher statistical power over a range of parameter values and with different data distributions. The overdispersed log-linear method also appears to have better performance on the real SAGE data which we analyze; a number of cases are seen where a tag is identified by the log-linear approach and appears to be clearly differentially expressed, but which would not have been identified as significant using the logistic regression method. Overdispersed log-linear models also offer the same flexibility as logistic regression, allowing for modeling multiple factors and/or covariates. We conclude that the overdispersed log-linear models provide an attractive and reliable framework for analyzing SAGE experiments involving multiple libraries.
Results
Overdispersed log-linear models: a case study
Overdispersed log-linear models (see details in Methods) are very similar to overdispersed logistic models, but there are two major differences. First, overdispersed log-linear models work with logarithms of proportions (the log link) with logarithms of sample sizes m_{ i }as offsets. In contrast, overdisersped logistic models use the log of the odds (the logit link). Second, the assumption of an overdispersed log-linear model leads to derived weights used by iteratively reweighted least squares (IRLS) that depend on the means of the tag counts (i.e. the weights depend on both library sizes and tag proportions). The weights in overdispersed logistic regression, in contrast, are a function of library sizes only (see Methods).
Comparisons of t- and deviance tests in overdispersed logistic regression and log-linear models and a test based on a Bayesian model
Group 1^{a} | logistic regression | log-linear model | Bayesian model | ||||
---|---|---|---|---|---|---|---|
library 1 | library 2 | t-test^{c} | deviance test | t-test^{c} | deviance test | E ^{d} | |
1^{b} | 0 | 0 | 0.645 | 0.115 | 0.003 | 0.001 | 0.01 |
2 | 2 | 2 | 0.485 | 0.122 | 0.002 | 0.002 | 0.02 |
3 | 5 | 5 | 0.383 | 0.133 | 0.003 | 0.005 | 0.04 |
4 | 10 | 10 | 0.324 | 0.149 | 0.007 | 0.01 | 0.05 |
5 | 20 | 20 | 0.291 | 0.183 | 0.02 | 0.025 | 0.07 |
6 | 50 | 50 | 0.324 | 0.29 | 0.104 | 0.117 | 0.11 |
7 | 100 | 100 | 0.494 | 0.508 | 0.376 | 0.404 | 0.12 |
Simulation study
To systematically evaluate the performance of the various tests in the case of two-group comparisons, we performed a simulation study. The tests compared here are the t, t_{ w }, logit-t and log-t. For t and t_{ w }, the test is whether _{ p }A = _{ p }B, where _{ p }A and _{ p }B are the mean proportion in groups A and B respectively. The logit-t and log-t are t tests on the hypothesis of whether β = 0 in the overdispersed logistic regression and log-linear models respectively. We do not attempt to replace the t-test with the deviance test in the overdispersed logistic regression model since this requires making a possibly subjective decision on when to use one test in preference to the other.
A list of parameter values used in the simulations
Distribution | binomial (i.e. no overdispersion); beta-binomial; negative-binomial |
---|---|
overdispersion parameter (φ) | 8e-06, 2e-05, 4.3e-05 for beta-binomial; 0.17, 0.42, 0.95 for negative binomial |
number of samples in groups A and B | 5 in each group |
mean proportion in group A (p_{ A }) | 1, 5, 10, 20, 50, and 100 out of 50,000 |
ratio of mean proportions (p_{ B }/ p_{ A }) | 1, 2 and 4 |
A pancreatic cancer dataset
Library information on 5 cancer and 2 normal pancreas SAGE libraries
Cancer cell lines | Normal cells | ||||||
---|---|---|---|---|---|---|---|
Library | ASPC | PL45 | CAPAN1 | CAPAN2 | Panc-1 | HX | H126 |
Library size | 31,224 | 29,557 | 37,674 | 23,042 | 24,749 | 31,985 | 32,223 |
Unique tags | 10,622 | 11,121 | 14,815 | 10,157 | 10,293 | 12,392 | 12,360 |
Pair-wise comparisons of the four tests
t-test | t_{ w }-test | logit-t | |
---|---|---|---|
t_{ w }-test | 39(12)^{a} | - | |
logit-t | 42(17) | 66(29) | - |
log-t | 36(16) | 63(25) | 82(43) |
A set of genes identified as significantly differentially expressed (p < 0.05 and also among the list of top 100 genes) according to the log-t test but not by the logit-t test (p > 0.05)
Normal | Cancer | ||||||||
---|---|---|---|---|---|---|---|---|---|
Tag | P (log-t) | P (logit-t) | HX | H126 | ASPC | PL45 | CAPAN1 | CAPAN2 | Panc-1 |
AGCAGATCAG* | 0.003 | 0.088 | 16 | 9 | 272 | 152 | 138 | 135 | 384 |
TTGGTGAAGG | 0.003 | 0.069 | 6 | 0 | 90 | 267 | 194 | 187 | 238 |
CCCATCGTCC | 0.003 | 0.309 | 13 | 34 | 2047 | 1333 | 364 | 456 | 408 |
CCTCCAGCTA | 0.006 | 0.465 | 3 | 16 | 452 | 1766 | 292 | 265 | 364 |
ACTTTTTCAA | 0.008 | 0.096 | 25 | 43 | 413 | 379 | 226 | 200 | 65 |
CAAACCATCC* | 0.01 | 0.463 | 9 | 9 | 439 | 1235 | 154 | 143 | 133 |
TGCCCTCAGG | 0.011 | 0.219 | 16 | 6 | 80 | 196 | 276 | 339 | 4 |
GCTGTTGCGC* | 0.011 | 0.151 | 3 | 3 | 35 | 30 | 82 | 126 | 133 |
GACATCAAGT* | 0.013 | 0.554 | 0 | 0 | 183 | 548 | 85 | 126 | 20 |
TTCACTGTGA | 0.014 | 0.149 | 0 | 3 | 128 | 105 | 77 | 91 | 16 |
TTGGGGTTTC | 0.015 | 0.142 | 69 | 37 | 701 | 507 | 173 | 195 | 230 |
TGCCCTCAAA | 0.016 | 0.246 | 3 | 6 | 32 | 112 | 135 | 178 | 0 |
GGGGAAATCG | 0.017 | 0.066 | 100 | 71 | 339 | 423 | 119 | 291 | 226 |
A list of top 40 genes differentially expressed between pancreatic cancer and normal ductal epithelium
Tag | Description | P | HX | H126 | ASPC | PL45 | CAPAN1 | CAPAN2 | Panc-1 |
---|---|---|---|---|---|---|---|---|---|
Up-regulated in pancreatic cancer | |||||||||
CTTCCAGCTA | annexin A2 | 0.0011 | 19 | 25 | 128 | 217 | 143 | 148 | 170 |
AAAAAAAAAA | - | 0.0018 | 6 | 3 | 128 | 210 | 180 | 165 | 133 |
AGCAGATCAG | S100 calcium binding protein A10 (annexin II ligand, calpactin I, light polypeptide (p11)) | 0.0027 | 16 | 9 | 272 | 152 | 138 | 135 | 384 |
TTGGTGAAGG | thymosin, beta 4, X-linked | 0.003 | 6 | 0 | 90 | 267 | 194 | 187 | 238 |
CCCATCGTCC | motichondria gene | 0.0032 | 13 | 34 | 2047 | 1333 | 364 | 456 | 408 |
CCTCCAGCTA | keratin 8 | 0.0059 | 3 | 16 | 452 | 1766 | 292 | 265 | 364 |
GGAAAAAAAA | ATP synthase, H+ transporting, mitochondrial F1 complex, epsilon subunit | 0.0063 | 3 | 6 | 64 | 61 | 74 | 74 | 57 |
CCCCAGTTGC | calpain, small subunit 1 | 0.0066 | 22 | 22 | 64 | 88 | 77 | 61 | 113 |
AACTAAAAAA | ribosomal protein S27a | 0.0078 | 19 | 16 | 45 | 85 | 80 | 61 | 61 |
TTCAATAAAA | RPLP1, Ribosomal protein, large, P1 | 0.0079 | 9 | 25 | 147 | 179 | 135 | 104 | 40 |
GCAAAAAAAA | chromosome 21 open reading frame 97 | 0.0079 | 6 | 3 | 58 | 68 | 40 | 65 | 65 |
ACTTTTTCAA | motichondria gene | 0.0081 | 25 | 43 | 413 | 379 | 226 | 200 | 65 |
CAAACCATCC | KRT18, Keratin 18 | 0.0095 | 9 | 9 | 439 | 1235 | 154 | 143 | 133 |
GTGTGGGGGG | Junction plakoglobin | 0.0096 | 6 | 3 | 29 | 64 | 50 | 56 | 61 |
TGCCCTCAGG | LCN2, Lipocalin 2 (oncogene 24p3) | 0.0106 | 16 | 6 | 80 | 196 | 276 | 339 | 4 |
GCTGTTGCGC | - | 0.0108 | 3 | 3 | 35 | 30 | 82 | 126 | 133 |
AAGAAGATAG | ribosomal protein L23a | 0.0116 | 16 | 9 | 77 | 108 | 85 | 65 | 24 |
GAAAAAAAAA | SMAD, mothers against DPP homolog 3 (Drosophila) | 0.0118 | 6 | 0 | 74 | 47 | 40 | 56 | 44 |
ACCTGTATCC | IFITM3, interferon induced transmembrane protein 3 (1-8U) | 0.0123 | 13 | 3 | 26 | 81 | 64 | 82 | 53 |
CAACTTAGTT | myosin regulatory light chain MRLC2 | 0.0128 | 6 | 6 | 51 | 61 | 53 | 48 | 16 |
Down-regulated in pancreatic cancer | |||||||||
GACGACACGA | ribosomal protein S28 | 0.0001 | 428 | 388 | 109 | 122 | 90 | 117 | 154 |
GGACCACTGA | ribosomal protein L3 | 0.0002 | 310 | 270 | 102 | 105 | 101 | 104 | 61 |
GATCTCTTGG | S100 calcium binding protein A2 | 0.0002 | 188 | 174 | 3 | 10 | 8 | 4 | 0 |
AGCAGGAGCA | S100 calcium binding protein A16 | 0.0005 | 144 | 152 | 26 | 41 | 45 | 26 | 16 |
AGCTGTCCCC | capping protein (actin filament) muscle Z-line, beta | 0.0005 | 219 | 254 | 13 | 3 | 3 | 4 | 0 |
GACTGCGCGT | tumor necrosis factor receptor superfamily, member 12A | 0.0007 | 103 | 93 | 10 | 10 | 24 | 22 | 16 |
GTGGTGTGTG | congenital dyserythropoietic anemia, type I | 0.0011 | 59 | 87 | 10 | 10 | 8 | 13 | 8 |
TAGGCATTCA | - | 0.0012 | 119 | 115 | 0 | 0 | 0 | 0 | 0 |
TGAGTGGTCA | microtubule-associated protein 1 light chain 3 beta | 0.0017 | 66 | 53 | 0 | 7 | 5 | 13 | 8 |
GGCGGCTGCA | excision repair cross- complementing rodent repair deficiency, group 1 | 0.0017 | 66 | 53 | 6 | 7 | 3 | 4 | 0 |
AAGTTTGCCT | glutaredoxin (thioltransferase) | 0.0022 | 66 | 62 | 0 | 3 | 3 | 0 | 4 |
AGCTCTCCCT | Ribosomal protein L17 | 0.0023 | 335 | 357 | 77 | 145 | 82 | 143 | 125 |
CCGAAGTCGA | transcriptional regulating factor 1 | 0.0024 | 53 | 56 | 0 | 7 | 5 | 0 | 0 |
GCTGCTGCGC | - | 0.0024 | 228 | 320 | 0 | 0 | 0 | 0 | 4 |
TTGGGAGCAG | isoleucine-tRNA synthetase | 0.0031 | 72 | 43 | 10 | 10 | 19 | 4 | 8 |
TAAGGAGCTG | Ribosomal protein S26 | 0.0031 | 344 | 329 | 138 | 85 | 96 | 43 | 101 |
AACAGAAGCA | hypothetical protein FLJ25692 | 0.0031 | 75 | 59 | 13 | 24 | 24 | 9 | 16 |
CCTCCACCTA | peroxiredoxin 2 | 0.0031 | 56 | 43 | 16 | 10 | 3 | 9 | 4 |
TGTGAGTCAC | - | 0.0038 | 31 | 62 | 0 | 0 | 0 | 0 | 0 |
TCAGGGATCT | - | 0.0038 | 41 | 53 | 0 | 0 | 0 | 0 | 0 |
Discussion
In this report we introduced a log-linear model with overdispersion for testing differential gene expression in SAGE. This model is closely related to the overdispersed logistic model proposed by Baggerly et al. [15] but with a different mean-variance relationship assumption. The differences between two models can be seen clearly in the weight (used by IRLS) associated with each observation: assuming library sizes are reasonably close, the overdispersed log-linear model tends to assign higher weights to observations in the group with the smaller mean proportion; in contrast, approximately equal weights are assigned to all the observations in the overdispersed logistic model. Although for real SAGE data the true mean-variance relationship is unknown, it has been observed that "for the higher counts data, the between-library variability is the dominant part of the variation" [13]; this suggests that the magnitude of the overdispersion in the group with higher counts is probably larger than that in the group with low counts so that the assumptions of the overdispersed log-linear model is probably more appropriate for SAGE data.
From the simulation study we have shown that, besides their limitation to two-group comparison, both the t- and t_{ w }-tests, in general, are not as powerful as tests which allow for the possibility of overdispersion. We mention one specific problem that can arise with the t- and t_{ w }-tests if the number of samples in the data set is small. Note that the rank orders from the t- test and the t_{ w }test in Table 4 are based on test statistics instead of p-values. The rank orders based on p-values can be different from those based on test statistics if the residual degrees of freedom differ among tests. Both the t- test and the t_{ w }-test use the Satterthwaite approximation [25] for the number of degrees of freedom since the variances are assumed to be different in the two groups. An example of how this can be problematic is given by tag "AGCTGTCCCC", which has tag counts 70, 82 in the two normal samples, and 4, 1, 1, 1, 0 in the five cancer cell line samples. The differential expression is highly significant based on the logit-t (p-value 0.0003) and log-t (p-value 0.0005) tests. In contrast, if the t_{ w }-test with the Satterthwaite approximation to the degrees of freedom is used, this tag is barely significant at the 5% level (p-value 0.050). The reason is that, while the magnitude of the t_{ w }statistic for this tag is actually extremely high (|t_{ w }| = 12.01), the calculated degrees of freedom is only about 1 (which leads to low significance). The small value for the degrees of freedom arises here because the estimated variance in the cancer group is very small; the approximated degrees of freedom is then about equal to the sample size of the normal group minus 1 (here, 2-1 = 1). Cases like this occur frequently in this data set since the number of libraries (samples) in one group is very small. It is not uncommon to have small sample numbers with SAGE data.
The four methods compared in this study follow the frequentist approach of hypothesis testing, and can be broadly considered as examples of linear models. For two-group comparisons, Vencio et al. [26] introduced a Bayesian approach to rank tags by the Bayes Error Rate. We compared their approach with the methods based on linear models by looking at differences in gene rankings determined using the pancreatic dataset. Considering the top 100 genes identified by the different tests, the two overdispersed models show the best agreement with the Bayesian method (~70% in common); 63 genes (of the top 100) are identified by all three tests. We also evaluated the Bayesian method using the artificial data in Table 1; as the tag counts in group 1 are increased, the evidence for differential expression decreases (i.e. the Bayes Error Rate goes up), which follows the expected trend. Furthermore, if we recognize tags with p < 0.05 or E<0.1 as being significantly differentially expressed [26], the results from the Bayesian approach are more consistent with those from the log-linear model than from the logistic models (see Table 1). Since the evidence measures used are conceptually very different, to perform a direct comparison between "P-value"-based methods and the Bayesian approach is not straightforward. Our results, however, suggest that the Bayesian approach of Vencio is a competitive Bayesian alternative for analyzing SAGE data in the case of two-group comparisons.
Methods
Data
Suppose that there are a total of n SAGE libraries in an experiment. Let m_{ i }be the size (total tag counts) of library i (i = 1..n) and r_{ i }be the tag counts for a specific tag in that library.
Also, let x_{ i }be the associated vector of explanatory variables and β the vector of coefficients. The comparison of two groups of SAGE libraries is a special case where there is only one explanatory variable associated with each observation (i.e. one factor with 2 levels).
The two-sample t-test
The t- test proposed by Welch [25] was used to test whether the mean of the proportions in one group equals the mean of the other. The proportions are assumed to have unequal variances in the two groups and the degrees of freedom is calculated based on the Satterthwaite approximation as in the t_{ w }-test (see below).
The t_{ w }-test
Baggerly et al. [13] introduced a beta-binomial sampling model to account for the extra-binomial variation for a simple design in which the comparison is between two groups of SAGE libraries. This is a special case of a linear model that contains one explanatory variable. Briefly, unobserved random variables P_{ i }are introduced to account for the between-library variation. For a given group, P_{ i }is assumed to have a beta distribution (α, β) with mean and variance E(P_{ i }) = α/(α+β), and Var(P_{ i }) = α β / [(α+β)^{2} (α+β+1)]. Notice that this is a special case of the form Var(P_{ i }) = φ p_{ i }(1 - p_{ i }) as in the overdispersed logistic model, where φ = 1/(α+β+1). Next, the group proportion is estimated by a weighted linear combination of individual proportions within the group , where = r_{ i }/m_{ i }and w_{ i }are weights associated with each individual proportion. The unbiased variance estimator of is given as
To avoid having an estimated variance that is less than the binomial sampling variance, a lower-bound for the variance is also provided. All the parameters (i.e. α, β and w_{ i }) are obtained through an iterative procedure. The same estimation procedure is applied to data from the other group. For testing whether the proportion in one group (say group A) equals the proportion in the other group (group B), a t- like statistic t_{ w }is constructed, where
The t_{ w }statistic is assumed to have a t- distribution with the degrees of freedom (df) calculated from the Satterthwaite approximation:
where n_{ A }and n_{ B }are the number of SAGE libraries in the group A and B respectively. This test is called the t_{ w }-test here. The implementation of both the t- and t_{ w }-test can be found in [13].
Overdispersed logistic regression approach
Baggerly et al. [15] provided a thorough description on this approach and details can be found in [24]. Briefly, unobserved continuous random variables P_{ i }are introduced to account for the between-library variation, where the mean and variance of P_{ i }have the following forms: E(P_{ i }) = p_{ i }; Var(P_{ i }) = φ p_{ i }(1 - p_{ i }). Here φ is a nonnegative scale parameter. Conditional on P_{ i }= p_{ i }, the r_{ i }have a binomial distribution (m_{ i }, p_{ i }). The unconditional mean and variance of r_{ i }can be shown to be E(r_{ i }) = m_{ i }p_{ i }and Var(r_{ i }) = m_{ i }p_{ i }(1 - p_{ i }) [1+(m_{ i }-1) φ]. Notice that if φ is 0 (i.e. there is no between-library variation or overdispersion), the variance of r_{ i }is the usual binomial variance m_{ i }p_{ i }(1 - p_{ i }). The estimation of coefficients β is carried out by the iteratively reweighted least-squares (IRLS) procedure, where the weights w_{ i }are 1/ [1+(m_{ i }- 1) φ]. Note that the weights w_{ i }are equal if the library sizes m_{ i }are equal.
The parameter φ is estimated by equating the goodness of fit Pearson's chi-square statistic X^{2} to its approximate expected value, which is
where v_{ i }= m_{ i }p_{ i }(1 - p_{ i }), and d_{ i }is the variance of the linear predictor . An iterative procedure is introduced to estimate φ and β, where the estimates of φ (and accordingly, the weights w_{ i }) and β are updated at each step. Given the estimated coefficients, the testing hypothesis is whether one (or more if there are more than two groups) of the coefficients (β) is 0. For this, the t-test rather than the z-test is recommended due to the introduction of the overdispersion parameter into the model [15, 32].
The hypothesis test based on overdispersed logistic regression is called the logit-t test here. The implementation including source code can be found in [15]. We consider overdispersion models (logistic or log-linear) only if the Pearson's chi-square statistic from the usual logistic regression (or log-linear) fit (i.e. without overdispersion) is greater than or equal to its expected value, n-p.
Overdispersed log-linear models
This model is closely related to the overdispersed logistic regression model. One way to derive it is based on the gamma-Poisson hierarchical model assumption [16]. Assume that an unobserved random variables θ_{ i }is distributed according to
θ_{ i }~ Gamma(μ_{ i }, 1/φ),
where μ_{ i }= m_{ i }p_{ i }, φ >0, E(θ_{ i }) = μ_{ i }and Var(θ_{ i }) = . Given p_{ i }, the response variable r_{ i }is assumed to be distributed as
r_{ i }| p_{ i }~ Poisson(μ_{ i }).
The unconditional mean and variance of r_{ i }can be shown to be E(r_{ i }) = μ_{ i }= m_{ i }p_{ i }and Var(r_{ i }) = μ_{ i }(1+μ_{ i }φ). Notice that as φ decreases to 0, the variance of r_{ i }approaches the usual Poisson variance μ_{ i }(i.e. m_{ i }p_{ i }). The same mean-variance relationship can also be derived if we assume r_{ i }has a negative- binomial distribution [16]. The mean μ_{ i }of the response variable r_{ i }and the covariates x_{ i }are connected through the log link function,
log μ_{ i }= log(m_{ i }p_{ i }) = x_{ i }β.
As in the overdispersed logistic regression model, the estimates of the coefficients β are obtained by the iteratively reweighted least-squares procedure, where the weights w_{ i }are 1/(1+μ_{ i }φ) [33]. Note that, in contrast to the overdispersed logistic regression model where the weights only depend on library sizes m_{ i }, the weights in the log-linear model depend on μ_{ i }(i.e. both m_{ i }and p_{ i }).
The hypothesis test based on overdispersed log-linear models is called the log-t test here. The R [34] source code and a web-interface for implementing this approach are available [35].
Declarations
Acknowledgements
The authors would like to thank anonymous reviewers for several constructive comments. We thank Gregory Riggins for introducing us to SAGE. We gratefully acknowledge the financial support of the NIH through the Duke University Center for Translational Research (5 P30 AI051445-03) and through the Southeast Regional Center of Excellence in Biodefense and Emerging Infections (U54 AI057157-02); and of the NSF through a grant to our collaborator David Bird (NCSU; DBI 0077503) as well as the Duke Center for Bioinformatics and Computational Biology for support through a postdoctoral fellowship to JL.
Authors’ Affiliations
References
- Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression.[comment]. Science 1995, 270: 484–487.View ArticlePubMedGoogle Scholar
- Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW: Gene expression profiles in normal and cancer cells. Science 1997, 276: 1268–1272. 10.1126/science.276.5316.1268View ArticlePubMedGoogle Scholar
- Riggins GJ, Strausberg RL: Genome and genetic resources from the Cancer Genome Anatomy Project. Human Molecular Genetics 2001, 10: 663–667. 10.1093/hmg/10.7.663View ArticlePubMedGoogle Scholar
- Porter D, Lahti-Domenici J, Keshaviah A, Bae YK, Argani P, Marks J, Richardson A, Cooper A, Strausberg R, Riggins GJ, Schnitt S, Gabrielson E, Gelman R, Polyak K: Molecular markers in ductal carcinoma in situ of the breast. Molecular Cancer Research: MCR 2003, 1: 362–375.PubMedGoogle Scholar
- Audic S, Claverie JM: The significance of digital gene expression profiles. Genome Research 1997, 7: 986–995.PubMedGoogle Scholar
- Chen H, Centola M, Altschul SF, Metzger H: Characterization of gene expression in resting and activated mast cells. Journal of Experimental Medicine 1998, 188: 1657–1668. [erratum appears in J Exp Med 1998 Dec 21;188(12):2387]. 10.1084/jem.188.9.1657PubMed CentralView ArticlePubMedGoogle Scholar
- Lal A, Lash AE, Altschul SF, Velculescu V, Zhang L, McLendon RE, Marra MA, Prange C, Morin PJ, Polyak K, Papadopoulos N, Vogelstein B, Kinzler KW, Strausberg RL, Riggins GJ: A public database for gene expression in human cancers. Cancer Research 1999, 59: 5403–5407.PubMedGoogle Scholar
- Kal AJ, van Zonneveld AJ, Benes V, van den Berg M, Koerkamp MG, Albermann K, Strack N, Ruijter JM, Richter A, Dujon B, Ansorge W, Tabak HF: Dynamics of gene expression revealed by comparison of serial analysis of gene expression transcript profiles from yeast grown on two different carbon sources. Molecular Biology of the Cell 1999, 10: 1859–1872.PubMed CentralView ArticlePubMedGoogle Scholar
- Man MZ, Wang X, Wang Y: POWER_SAGE: comparing statistical tests for SAGE experiments. Bioinformatics 2000, 16: 953–959. 10.1093/bioinformatics/16.11.953View ArticlePubMedGoogle Scholar
- Ruijter JM, Van Kampen AH, Baas F: Statistical evaluation of SAGE libraries: consequences for experimental design. Physiological Genomics 2002, 11: 37–44.View ArticlePubMedGoogle Scholar
- Blackshaw S, Kuo WP, Park PJ, Tsujikawa M, Gunnersen JM, Scott HS, Boon WM, Tan SS, Cepko CL: MicroSAGE is highly representative and reproducible but reveals major differences in gene expression among samples obtained from similar tissues. Genome Biology 2003, 4: R17. 10.1186/gb-2003-4-3-r17PubMed CentralView ArticlePubMedGoogle Scholar
- Ryu B, Jones J, Blades NJ, Parmigiani G, Hollingsworth MA, Hruban RH, Kern SE: Relationships and differentially expressed genes among pancreatic cancers examined by large-scale serial analysis of gene expression. Cancer Research 2002, 62: 819–826.PubMedGoogle Scholar
- Baggerly KA, Deng L, Morris JS, Aldaz CM: Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics 2003, 19: 1477–1483. 10.1093/bioinformatics/btg173View ArticlePubMedGoogle Scholar
- Walter-Yohrling J, Cao X, Callahan M, Weber W, Morgenbesser S, Madden SL, Wang C, Teicher BA: Identification of genes expressed in malignant cells that promote invasion. Cancer Res 2003, 63: 8939–8947.PubMedGoogle Scholar
- Baggerly KA, Deng L, Morris JS, Aldaz CM: Overdispersed logistic regression for SAGE: modelling multiple groups and covariates. BMC Bioinformatics 2004, 5: 144. 10.1186/1471-2105-5-144PubMed CentralView ArticlePubMedGoogle Scholar
- Casella G, Berger RL: Statistical Inferences. 2nd edition. Pacific Grove, CA: DuXBURY; 2002.Google Scholar
- Gelman A, Carlin J, Stern H, Rubin D: Bayesian Data Analysis. Second edition. Boca Raton, Florida: CHAPMAN & HALL/CRC; 2004.Google Scholar
- Shapiro DE: The interpretation of diagnostic tests. Stat Methods Med Res 1999, 8: 113–134. 10.1191/096228099666928387View ArticlePubMedGoogle Scholar
- SAGE Genie:[http://cgap.nci.nih.gov/SAGE]
- Boon K, Osorio EC, Greenhut SF, Schaefer CF, Shoemaker J, Polyak K, Morin PJ, Buetow KH, Strausberg RL, De Souza SJ, Riggins GJ: An anatomy of normal and malignant gene expression.[comment]. Proceedings of the National Academy of Sciences of the United States of America 2002, 99: 11287–11292. 10.1073/pnas.152324199PubMed CentralView ArticlePubMedGoogle Scholar
- Vishwanatha JK, Chiang Y, Kumble KD, Hollingsworth MA, Pour PM: Enhanced expression of annexin II in human pancreatic carcinoma cells and primary pancreatic cancers. Carcinogenesis 1993, 14: 2575–2579.View ArticlePubMedGoogle Scholar
- Paciucci R, Berrozpe G, Tora M, Navarro E, Garcia de Herreros A, Real FX: Isolation of tissue-type plasminogen activator, cathepsin H, and non-specific cross-reacting antigen from SK-PC-1 pancreas cancer cells using subtractive hybridization. FEBS Lett 1996, 385: 72–76. 10.1016/0014-5793(96)00352-3View ArticlePubMedGoogle Scholar
- Agresti A: Categorical Data Analysis. 2nd edition. Hoboken, New Jersey: A John Wiley & Sons, Inc., Publication; 2002.View ArticleGoogle Scholar
- Williams DA: Extra-binomial variation in logistic linear models. Applied Statistics 1982, 31: 144–148.View ArticleGoogle Scholar
- Welch BL: The generalization of 'students' problem when several different population variances are involved. Biometrika 1947, 34: 28–35.PubMedGoogle Scholar
- Vencio RZ, Brentani H, Patrao DF, Pereira CA: Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression (SAGE). BMC Bioinformatics 2004, 5: 119. 10.1186/1471-2105-5-119PubMed CentralView ArticlePubMedGoogle Scholar
- Manly KF, Nettleton D, Hwang JT: Genomics, prior probability, and statistical tests of multiple hypotheses. Genome Res 2004, 14: 997–1001. 10.1101/gr.2156804View ArticlePubMedGoogle Scholar
- Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol 2005, 6: R16. 10.1186/gb-2005-6-2-r16PubMed CentralView ArticlePubMedGoogle Scholar
- Kepler TB, Crosby L, Morgan KT: Normalization and analysis of DNA microarray data by self-consistency and local regression. Genome Biol 2002, 3: RESEARCH0037. 10.1186/gb-2002-3-7-research0037PubMed CentralView ArticlePubMedGoogle Scholar
- Wright GW, Simon RM: A random variance model for detecting of differential gene expression in small microarray experiments. Bioinformatics 2003, 19: 2448–2455. 10.1093/bioinformatics/btg345View ArticlePubMedGoogle Scholar
- Cui X, Hwang JT, Qiu J, Blades NJ, Churchill GA: Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics 2005, 6: 59–75. 10.1093/biostatistics/kxh018View ArticlePubMedGoogle Scholar
- LIoyd CJ: Statistical Analysis of Categorical Data. New York, NY: JOhn Wiley & Sons; 1999.Google Scholar
- Breslow NE: Extra-Poisson variation in log-linear models. Applied Statistics 1984, 33: 38–44.View ArticleGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2003.Google Scholar
- The website of overdispersed log-linear models for SAGE:[http://dulci.biostat.duke.edu/sage]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments
View archived comments (1)