 Research article
 Open Access
 Published:
Modeling Sage data with a truncated gammaPoisson model
BMC Bioinformatics volume 7, Article number: 157 (2006)
Abstract
Background
Serial Analysis of Gene Expressions (SAGE) produces gene expression measurements on a discrete scale, due to the finite number of molecules in the sample. This means that part of the variance in SAGE data should be understood as the sampling error in a binomial or Poisson distribution, whereas other variance sources, in particular biological variance, should be modeled using a continuous distribution function, i.e. a prior on the intensity of the Poisson distribution. One challenge is that such a model predicts a large number of genes with zero counts, which cannot be observed.
Results
We present a hierarchical Poisson model with a gamma prior and three different algorithms for estimating the parameters in the model. It turns out that the rate parameter in the gamma distribution can be estimated on the basis of a single SAGE library, whereas the estimate of the shape parameter becomes unstable. This means that the number of zero counts cannot be estimated reliably. When a bivariate model is applied to two SAGE libraries, however, the number of predicted zero counts becomes more stable and in approximate agreement with the number of transcripts observed across a large number of experiments. In all the libraries we analyzed there was a small population of very highly expressed tags, typically 1% of the tags, that could not be accounted for by the model. To handle those tags we chose to augment our model with a nonparametric component. We also show some results based on a lognormal distribution instead of the gamma distribution.
Conclusion
By modeling SAGE data with a hierarchical Poisson model it is possible to separate the sampling variance from the variance in gene expression. If expression levels are reported at the gene level rather than at the tag level, genes mapped to multiple tags must be kept separate, since their expression levels show a different statistical behavior. A lognormal prior provided a better fit to our data than the gamma prior, but except for a small subpopulation of tags with very high counts, the two priors are similar.
Background
In Serial Analysis of Gene Expression (SAGE), mRNA is extracted from a tissue sample and converted to cDNA, from which oligonucleotides (socalled SAGE tags) at specific locations in the cDNA fragments are extracted and amplified using PCR. Those tags are either ten or seventeen bases long, depending on the experimental protocol. Sequencing the PCR product, it is possible to establish the number of copies of each tag extracted. (For an elaborate description of the technology, see Velculescu [1]). Ideally, there would be a onetoone relation between tags and genes, so that the number of copies of a tag would be an indicator of the rate of transcription of the corresponding gene. Suppose the tissue sample contained n_{ t }copies of tag t each of which have a probability p of being extracted. The exact magnitude of p is unknown (and depends on experimental circumstances) but is certainly much smaller than 1 (Kuznetsov [2])), which suggests modeling the number y_{ t }of observed copies of tag t (the socalled SAGE count) as Poisson distributed with intensity λ_{ t }= pn_{ t }.
A Poisson model predicts a (large) number of zero counts, i.e. tags with positive lambdavalues that just happened not to be counted. Those cannot be distinguished from tags that do not exist at all or are never transcribed. The problem of estimating the total number of expressible tags (the size of the transcriptome) was studied by Stern [3], who found the number of tags expressed at each level to be inversely proportional to the square of the expression level. Stern concluded that the size of the transcriptome could not be reliably estimated from SAGE data. Part of the problem is that a substantial part of the lowexpressed tags may be artifactual, which is difficult to incorporate in the model. (Some authors have developed statistical models for SAGE data that take artifactually low counts into account, see Blades [4], Beissbarth [5] and Anisomov [6]). Kuznetsov [7], [8] modeled the SAGE data using a discrete Paretolike distribution and found that his model was able to predict the number of transcripts expressed at a level of ≥ 1 copy per cell. Although this was a major breakthrough, the discrete Paretolike distribution models the counts directly, which means that sample variance is not explicitly separated from the variance in gene expression. The model that we explore in this paper is an hierarchical Poisson model, i.e. a Poisson distribution with some prior distribution f of Poisson parameter λ
Y_{ t }~ Poisson(λ_{ t }), λ_{ t }~ f(·,θ) (1)
where Y_{ t }is the observed count for tag t, λ_{ t }is the "true" expression level of tag t and θ is some parameter in the model. For the prior f we tried a number of candidates (gamma, mixture of two gamma's, lognormal, Pareto). The gamma prior turned out to provide a good fit to the distributions of the tag counts for counts lower than a certain threshold, typically the 98th or 99th percentile. Attempts to model the tag counts above that threshold with a second gammadistributed component failed, not surprisingly since the number of tags in that range was too small to support meaningful estimates.
For the purpose of this paper we choose the gamma distribution, whose parameters were estimated with an empirical Bayesian approach [9]. The choice of the Gamma distribution was motivated by mathematical convenience only. The Gamma distribution is the conjugate prior of the Poisson distribution, i.e. if the parameters α and β are known, the posterior distribution of λ_{ t }given y_{ t }is distributed as Gamma(α + y_{ t }, β + 1) This is convenient because the posterior distribution of γ represents our knowledge of the true gene expression after the SAGE count has been observed. Also, since 1/β is a scale parameter in the Gamma distribution, libraries of different size can be compared. Other things being equal, we expect the estimated value of β to be inversely proportional to the library size.
The marginal distribution of Y becomes a negative binomial distribution:
$P(Y=y)=\frac{\Gamma (y+\alpha )}{\Gamma (\alpha )y!}\frac{{\beta}^{\alpha}}{{(\beta +1)}^{y+\alpha}}\text{}\left(2\right)$
and in particular
$P(Y=0)={\left(\frac{\beta}{\beta +1}\right)}^{\alpha}\text{}\left(3\right)$
Since the zero counts are not recorded, the counts of the recorded tags follow a zerotruncated negative binomial distribution
$P(Y=yY>0)=\frac{\Gamma (y+\alpha )}{\Gamma (\alpha )y!}\frac{{\beta}^{\alpha}}{{(\beta +1)}^{y+\alpha}}\frac{1}{1{\left(\frac{\beta}{\beta +1}\right)}^{\alpha}}\text{}\left(4\right)$
The zerotruncated negative binomial distribution has been studied by several authors, mainly for modeling group sizes. See Johnson [10] for an overview. Schenzle [11] studied the efficiency of various estimation methods for the parameters and reached the conclusion that for α < 1 heuristic methods do not work and ML estimation should be used instead.
If such a model can provide a good fit to the data, it will be useful for several purposes. For example, knowing the posterior distribution of λ_{ t }given Y_{ t }, one can construct a variancestabilizing transform for a SAGE library. Also, in order to assess the sensitivity of the SAGE technology with respect to genes with low expression levels, one needs to know the distribution of λ. The idea of applying Poisson models to SAGE data is not new. Cai [12] found a Poissonbased gene clustering algorithm to work better than one based on Euclidian distances. And several authors (Vencio [13], Ruijter [14]) have compared Poissonbased tests for differentiation in gene expression between libraries. However, those tests always look at a single SAGE tag at a time and tests for difference in the parameter λ across the libraries. As noted by Baggerly [15], it is possible that a model that incorporates all tags simultaneously, analogous to the models applied to microarray data, would be more powerful. This motivates our attempt to model the behavior of λ across genes. In this paper, we present a truncated GammaPoisson model augmented with a nonparametric component, as well as a bivariate truncated GammaPoisson model that can be used to comparing two libraries, and we show how the parameters can be estimated in those models.
Results
Untruncated gammaPoisson, ML and method of moments
In the human transcriptome map, genes with zero counts were reported, which allowed for the untruncated negative binomial model to be used. While the maximum likelihood (Johnson [10]) method worked quite well with simulated data, it did not provide meaningful results with the data from the Human Transcriptome Map, presumably because of the failure to account for a small population of tags with extremely high counts (typically 1% of all tags). Because those tags had high influence on the likelihood function, it was necessary to take them into account. Therefore, we assumed the expressed tags to be a mixture of two populations: a large population of tags with expression levels below some threshold k, described by the gammapoisson model, and a small population of tags with expressions at or above k, described by a nonparametric distribution. Because the maximum likelihood estimator itself requires iteration, we decided to use a heuristic to choose the threshold k and the startguess for the maximum likelihood estimator (See the Methods section).
As for the Human Transcriptome Map, the model provided a good fit to the counts for tags of low to moderate expression, but for all 72 libraries it was necessary to assume a nonparametric component accounting for the extremely strongly expressed genes. Figure 1 shows the expected frequencies versus the observed frequencies for HTM library N225.111k (a neuroblastoma cell line), which was the largest library.
The quartiles of the estimated parameters for the 72 libraries are shown in table 1.
We expected the estimated parameter β to scale with the inverse of the library size, and indeed we found a clear negative correlation between log(β) and log(library size), R = 0.78.
Using the estimates from N225.111k (α = 0.146, β = 0.171) as an example, variance of λ is α/β^{2} = 1.55.
The average sampling variance is identical to the mean, α/β = 0.27. Those average variances, however, hide huge differences in information content between tags with different expression levels: The coefficient of variation of the posterior λ_{ t }given a count y_{ t }is $1/\sqrt{{y}_{t}+\alpha}$
Truncated gammaPoisson, Maximum Likelihood and MCMC
The loglikelihood in the truncated negative binomial model, without the nonparametric component for the highexpressed tags, is
$\sum _{t}\left(\alpha log(\beta )+{\displaystyle \sum _{j=0}^{{Y}_{t}1}log(\alpha +j)(\alpha +{Y}_{t})log(\beta +1)log\left(1{\left(\frac{\beta}{1+\beta}\right)}^{\alpha}\right)}\right)}\text{}\left(5\right)$
We augmented this model with a nonparametric component for the highexpressed genes in the same way as the untruncated model described above.
While the estimate of the rate parameter in the truncated model became reasonably stable, and, as expected, approximately inversely proportional to the library size, the shape parameter became unstable, in particular with real data but also with simulated data.
The rate parameter (α) in most libraries converged towards the lower bound specified in the call to the maximization routine (0.001). A curious exception was GSM1130, a small library with only 17004 tags, which gave an estimate of α of 0.77 in the censored model as opposed to 0.05 in the uncensored model. For all the other libraries, the estimate of β for the censored model was in agreement with that from the uncensored model (mean ratio = 1.06, SD= 0.10). Computations with simulated data did not show a bias on the estimate of α, but still the estimated value of α was very unstable. Presumably, this difference between real and simulated data is related to the failure of the model to account for the extreme highexpressed tags, which were modeled with a nonparametric component.
When the parameters in the same model were estimated using MCMC, the rate parameter β was stable and in agreement with the results from the other methods, but the shape parameter (α) was, again, unstable. The estimated parameters for the HTM library N225.111k (a neuroblastoma cell line) are shown in table 2.
Truncated lognormalPoisson, Maximum Likelihood
Instead of incorporating a nonparametric component for the highexpressed genes, one may look for a different prior that has a thicker tail. For that purpose, we used a lognormal prior. From a biological point of view, the lognormal distribution is an attractive model for the true gene expression, because it is conventionally used for analysis of microarray data. The disadvantage of that model, however, is that the marginal distribution of the SAGE counts cannot be written on closed form. Therefore, the likelihood must be evaluated with numerical integration.
A simpler method would be to use the method of moments. Since the the probability of a zero count cannot be expressed on closed form, however, numerical integration is still required, though only for the zero counts. Unfortunately, experiments with simulated data showed that the parameters estimated by the method of moments were severely biased (estimated value of σ^{2} typically twice the true value). The MLalgorithm converged in 71 of the 72 libraries, the exception being GSM1130, the library that also gave contradicting βestimates for the truncated and untruncated gammaPoisson model. Figure 1 shows the frequency plot predicted by the truncated lognormal Poisson model, compared to the empiric frequency plot. Notice that while the model provides a good prediction of the number of strongly expressed tags, it gives a much lower estimate of the number of zero counts than recorded in the data set. That picture is typical.
Bivariate truncated gammaPoisson, Maximum Likelihood, HTM data
The fact that the shape parameter in the truncated model could not be estimated on the basis on a single SAGE library is related to the problem of estimating the number of zero counts. In order to solve this problem we fitted the parameters in a bivariate model applied to two libraries. When two libraries are compared, the tags that have positive counts in one library and zero count in the other are reported. If the correlation between the two libraries could be estimated, the number of tags that had zero counts in both libraries could be estimated as well, and we would get a stable estimate of the shape parameter. See the Methods section for details.
We fitted the bivariate model to all pairs from the 18 largest libraries. Of those 153 library pairs, the likelihood maximization algorithm converged in 151 cases. In 127 of those, the fourparametermodel, allowing α to be different for the two libraries, provided a significantly better fit (twice loglikelihood ratio higher than the 95percentile for the chisquare distribution with one degree of freedom). This is not surprising, given that there was considerable dispersion across libraries of the estimated α from the untruncated model. With this model, the median of our estimates of the transcriptome size for the subset selected in the HTM was median = 8030 with a quartile range from 6368 to 13767, compared to the 8100 reported in the HTM.
Figures 2, 3 and 4 show how the model compares to the observed data when HTM library IDC3 (Breast Tumor) is compared to HTM library 145 (Normal brain). Figure 2 shows a frequency plot for IDC3 for tags that were not observed in libray 145, figure 3 is for tags that had count = 1 in library 145 and figure 4 for tags that had count = 5 in library 145. Notice that the model accurately predicts the number of tags with zero counts (figure 3 and 4), given a positive count in the other library.
Since the bivariate model specifies the probability of (Y_{1}, Y_{2}) = (y_{1}, y_{2}) we were able to quantify the correlation between two libraries with the correlation coefficient. As seen in figure 5, a low correlation implies that the two libraries are based on different tissue types, while the converse is not necessarily true.
Bivariate truncated gammaPoisson, Maximum Likelihood, raw SAGE libraries
As expected, the bivariate model underestimated the number of tags with a count of one, presumably due to sequencing errors. Figure 6 shows a frequency plot for tag counts in SAGEgenie library 1003, conditioned on a count of one in library 430, compared to the predictions from two different versions of the bivariate gammapoissonmodel: the first estimate is based on the assumption that the model fits the data over the entire range, the second is based on the assumption that tags with count one in one library and zero in the other are unreliable and therefore have to be ignored (technically, a nonparametric component accounting for the tags with (1,0) or (0,1) count is assumed). As seen in figure 6, the second model fits much better. The number of tags with a count of one, assigned to the nonparametric component, was 31500 for library 430 and 32100 for library 1003, corresponding to sequencing error rates of 1.0% and 0.7%. This is similar to Beissbart's estimates (between 0.5% and 1.5%).
Discussion
We have demonstrated that the Poisson distribution with a conjugate gamma prior provides a good fit to real data with the exception of a small population (typically about 1%) of SAGE tags. However, the univariate model does not provide a stable estimate of the shape parameter. For the interpretation of the SAGE counts, this is not so terrible, since the posterior distribution of the true tag expression (λ_{ t }) given count Y_{ t }is gamma(Y_{ t }+ α, β + 1). With a value of α typically between 0 and 0.5, the posterior for distribution of λ given a positive count becomes insensitive to α. However, the posterior distribution of λ given a zero count requires a reliable estimate of α.
For the untruncated model, we assumed the total number of expressible tags (in the subset under study) to be 8100 as recorded in the Human Transcriptome Map. One could ask the question whether this set is (roughly) complete or whether there is a significant number of extreme lowexpressed tags which have not been recorded in any of the libraries. The fact that the number of expressible tags, as estimated by the bivariate model, had a median of 8030 suggest that the number of tags recorded in the Human Transcriptome Map is of the correct order of magnitude.
The small population of strongly expressed tags has been noted by other authors([16], [4]) before. It is possible that it has a biological or technological interpretation. However, Kuznetsov [7] showed that a discrete Paretolike distribution accurately predicts the number of highexpressed genes, which suggests that it is a modeling issue rather than a separate group of genes. As seen in figure 1, a Poisson distribution with a lognormal prior, which is also a biologically appealing model, may also be able to predict the number of highexpressed genes. Since the majority of the genes are expressed at such low levels that the difference between the gamma prior and the lognormal prior is small, we decided to base the bivariate model on the gamma prior which is mathematically more convenient.
As a consequence of this choice, the model does not provide a posterior distribution of λ given a count above the threshold. Fortunately, this is not so critical because for a high count the posterior mean will be close to the observed count.
Another issue relates to genes mapped to multiple tags. It is reasonable to assume some correlation between two tags representing the same gene. In the human transcriptome map, the counts for those genes were reported as the sum of the tag counts. Suppose a gene is represented by two tags, the count of both being Poisson distributed with intensity λ ~ Gamma(α, β) and correlation coefficient R. If the total count for the gene is Poisson distributed with intensity λ* ~ Gamma(α*, β*), we have
$E(\lambda *)=2E(\lambda )\Rightarrow \frac{\alpha *}{\beta *}=2\frac{\alpha}{\beta}\text{}\left(6\right)$
$VAR(\lambda *)=(2+R)VAR(\lambda )\Rightarrow \frac{\alpha *}{\beta *2}=(2+R)\frac{\alpha}{{\beta}^{2}}\text{}\left(7\right)$
and thus
$\beta *=\frac{\beta}{1+R}\text{}\left(8\right)$
$\alpha *=\frac{2\alpha}{1+R}\text{}\left(9\right)$
This shows that if the counts per gene, rather than the counts per tag, are reported in a data set, genes with different numbers of representing tags should be kept separate. As shown in figure 7, the estimated values of α for genes mapped to two or three tags are proportional to the estimated values of α for the genes mapped to a single tag. This suggests that the tag counts have the same distribution, whether they share the gene with other tags or not. The proportionality constant of 2.2 for genes mapped to two tags and 3.5 for genes mapped to three tags correspond to a correlation coefficient of approximately 0.05. This is a surprising result, since we found positive correlation between tags mapping to the same gene in the data used by Cai [12]. A possible explanation for the negative correlation is that different splicing variants compete for the same transcription product.
By modeling two SAGE libraries with a bivariate truncated negative binomial model, it was possible to achieve a more stable estimate of α. More important, the bivariate model has a useful interpretation: the shared gamma process Z is the main effect (gene effect) while the independent gamma process X is the (gene, library) interaction effect. A further generalization of the bivariate model will be to incorporate multiple interaction effects in a multivariate model, for example a third gamma process related to treatment groups.
We have made the distinction between truncated models (only positive tags considered) untruncated model (the set of expressible tags assumed to be known). It could be argued, however, that even when zero counts are reported (such as in multilibrary data sets including all tags with a positive count in at least one library), the untruncated models should allow for a an unknown number of nonrecorded tags. Such models are called zerodeflated Poisson models.
By the analysis of the HTM data, we have ignored the issue of sequencing errors. Figure 6 suggests that, augmented with a model for the sequencing errors, the model could be applied to raw SAGE data as well.
Conclusion
SAGE data appear, when sequencing errors are handled properly, to follow a Poisson mixture with a lognormal prior on the Poisson parameter. The gamma prior (leading to a negative binomial distribution for the SAGE counts) provides a good approximation for low counts (up to between 10 and 20, depending mainly on library size). Using a bivariate gammaPoisson model, the transcriptome size can be estimated from the data; alternatively, the list of expressible tags from the Human Transcriptome Map can be used. Whether one prefers the mathematically convenient gamma prior or a lognormal prior traditionally applied to microarray data, and whether one prefers a parametric or nonparametric model for the highexpressed tags, we believe that the Poisson model is useful for analyzing SAGE data because it separates the sample variance in the Poisson process from the biological variance. Vencio [13] and Cai [12] both demonstrated it to be superior to alternative methods for gene expression differentiation testing and gene clustering, respectively. Assuming a prior distribution of the λ's across the genes could boost their methods further.
Methods
Data
We used 72 libraries from the Human Transcriptome Map (HTM) (Caron et al. [17]), which is a compilation of SAGE libraries from different human tissues. It contains expressions of 19825 genes, of which 8100 are represented by a single tag. In this paper, we focus on those genes mainly (we will motivate this choice in the discussion section). The remaining genes were mapped to two or more tags and the reported counts were the sums of the counts of those tags. Unlike raw SAGE libraries, the Human Transcriptome Map includes genes with zero counts, because each library contains counts for all genes that were expressed in at least one library. Also, tag counts considered likely to be false positives (such as sequencing errors) had been removed. See the HTM website [18] for details.
We also used two of the shorttag libraries from SAGEgenie [19] (library no. 430 and 1003). Those are raw data, i.e. tags likely to be false positives had not been removed.
Truncated gammaPoisson, ML and method of moments
We assumed that for each expression level i ≥ k the gammapoissonmodel predicts only a ratio w_{ i }≤ 1 of the observed number of tags with that expression level. Therefore, tags with expression level i are assigned a weight of w_{ i }when estimating α and β. Initially, the weights were set to 1 and k was set to the maximum observed expression level. Now, for each iteration step, k was decremented by 1, while α and β were estimated using the method of moments:
$\widehat{\alpha}=\frac{{\overline{y}}^{2}}{VAR(y){\overline{y}}^{2}}\text{}\left(10\right)$
$\widehat{\beta}=\frac{\overline{y}}{VAR(y){\overline{y}}^{2}}\text{}\left(11\right)$
For each i ≥ k the expected number of tags y_{i,exp}was computed. Now, the weights w_{ i }= y_{i,exp}/y_{i,obs}were computed and used for the estimation of α and β. This was iterated until Akaike's Information Criteria reached its optimum.
Truncated gammaPoisson, Maximum Likelihood and MCMC
The loglikelihood in the truncated negative binomial model, without the nonparametric component for the highexpressed tags, is
$\sum _{t}\left(\alpha log(\beta )+{\displaystyle \sum _{j=0}^{{Y}_{t}1}log(\alpha +j)(\alpha +{Y}_{t})log(\beta +1)log}\left(1{\left(\frac{\beta}{1+\beta}\right)}^{\alpha}\right)\right)}\text{}\left(12\right)$
We augmented this model with a nonparametric component for the highexpressed genes in the same way as the untruncated model described above. We maximized the loglikelihood using a quasiNewtonRaphson method (SPLUS function nlmin), and made use of the SPLUS option of computing the gradient and Hessian using the double dogleg step(Venables [20]). As a start guess for the iteration, we used the method of moments (see above). Note that although the sum over all tags is in principle a sum over several thousands indices, in practice it is only a sum over all observed levels of tag expression (counts), due to the discrete nature of the data.
For the threshold k for the nonparametric component, the value found by the method of moments (described above) was used (alternatively k can be handled as another parameter, but it turned out to be difficult to identify). For the prior distribution of the parameters, we used α ~ exp(1) and β ~ gamma(0.1, 1). In order to quantify the uncertainty on the parameter estimates, we also computed the Bayesian a posteriori distribution of the parameters in the truncated model with nonparametric component using a Markov Chain Monte Carlo (MCMC) algorithm. The MCMC simulations were carried out with WinBugs [21].
Bivariate Truncated gammaPoisson, Maximum Likelihood
As a model for the true gene expressions (λ) in two libraries (1 and 2), we assumed the trivariate reduction model (Mathal and Moschopoulos [22]):
(λ_{1}, λ_{2}) = (μ_{1} + τ, μ_{2} + τ), μ_{1}, μ_{2} ~ Gamma(α, β), τ ~ Gamma(ρ, β) (13)
In this model the observed counts Y_{t 1},Y_{t 2}with intensities λ_{t 1}and λ_{t 2}are assumed to arise as sums of a shared component Z with intensity τ and nonshared components X_{1} and X_{2} with intensities μ_{1} and μ_{2}.
Assuming that the rate is inversely proportional to the library sizes n_{1}, n_{2}, we get
Y_{1} = X_{1} +Z_{1}, Y_{2} = X_{2} + Z_{2} (14)
X_{1} ~ negbinom(α, β/n_{1}), X_{2} ~ negbinom(α, β/n_{2}) (15)
${Z}_{1}+{Z}_{2}~negbinom(\rho ,\frac{\beta}{{n}_{1}+{n}_{2}})\text{}\left(16\right)$
$({Z}_{1},{Z}_{2})({Z}_{1}+{Z}_{2})~binom({Z}_{1}+{Z}_{2},\frac{{n}_{1}}{{n}_{1}+{n}_{2}})\text{}\left(17\right)$
$P\left(({Y}_{1},{Y}_{2})=({y}_{1},{y}_{2})\right)={\displaystyle \sum _{i=0}^{{y}_{1}}{\displaystyle \sum _{j=0}^{{y}_{2}}P\left(\left({Z}_{1},{Z}_{2}\right)=(i,j)\right)P({X}_{1}={y}_{1}i)P({X}_{2}={y}_{2}i)}}\text{}\left(18\right)$
The correlation between lambda_{1} and lambda_{2} is
$corr(lambd{a}_{1},lambd{a}_{2})=\frac{VAR(\tau )}{\sqrt{VAR({\lambda}_{1})VAR({\lambda}_{2})}}\text{}\left(19\right)$
VAR(λ.) = VAR(μ.) + VAR(τ) (20)
where the variances of μ_{1}, μ_{2} and τ are derived from the gamma distribution. For each pair of libraries from the 19 largest libraries from the Human Transcriptome Map, we estimated the parameters in this model. Anticipating that the model would not fit to the frequencies of counts above the threshold found in the univariate model, which varied between 9 and 22, we restricted the analysis to tags with counts below 15 in both libraries. We also fitted an augmented model in which X_{1} and X_{2} were allowed to have different values of α. In that model, the marginal distributions of Y_{1} and Y_{2} are negative binomial with parameters (α_{1} + ρ, β/n_{1}) and (α_{2} + ρ, β/n_{2}), respectively. Note that it is not expected that α_{1} is characteristic for library 1 when library 1 is modeled together with different libraries: if library 1 and 2 show a high degree of correlation, ρ will be larger at the expense of α_{1} and α_{2}
Accounting for sequencing errors in raw SAGE data
Raw SAGE data contain a high number of tags with a count of one, many of which, presumably, are artifacts such as sequencing errors. Beissbarth [5] estimated the frequency of such artificial counts to be between 5% and 15% of the total number of tags, corresponding to an error rate of 0.5% to 1.5% per nucleotide. When analyzing the HTM data we assumed that the data had been thoroughly cleaned. But in order to apply our model to raw SAGE data, one needs to account for sequencing errors. We chose to incorporate sequencing errors in the bivariate model by assigning tags with counts one in one library and zero in the other library to a nonparametric component.
Abbreviations
 HTM :

Human Transcriptome Map
 SAGE :

Serial Analysis of Gene Expression
 MCMC :

Markov Chain Monte Carlo
References
 1.
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science 1995, 270: 484–487.
 2.
Kuznetsov VA: Distribution associated with stachastic processes of gene expression in a single eukaryotic cell. EURASIP Journal of Applied Signal Processing 2001, 4: 285–296. 10.1155/S1110865701000294
 3.
Stern MD, Anisimov SV, Boheler KR: Can transcriptome size be estimated from sage catalogs? Bioinformatics 2003, 19: 443–448. 10.1093/bioinformatics/btg018
 4.
Blades NJ, Jones JB, Kern SE, Parmigiani G: Denoising ofdata from serial analysis of gene expressions. Bioinformatics, in press.
 5.
Beissbarth T, Hyde L, Smyth GK, Job C, Boon WM, Tan SS, Scott HS, Speed TP: Statistical modelling of sequencing errors in sage libraries. Bioinformatics 2004, (Suppl 1):31–39. 10.1093/bioinformatics/bth924
 6.
Anisimov SV, Sharov AA: Incidence of "quasiditags" in catalogs generated by serial analysis of gene expression (sage). BMC Bioinformatics 2004, 5: 152. 10.1186/147121055152
 7.
Kuznetsov VA, Knott GD, Bonner RF: General statistics of stochastic process of gene expression in eukaryotic cells. Genetics 2002, 161: 1321–1332.
 8.
Wang SM, (Ed): SAGE: Current technologies and applications. Horizon Biosci; 2005.
 9.
Carlin BP, Louis TA: Bayes and Emperical Bayes methods for data analysis. Chapma and Hall, London; 1996.
 10.
Johnson NL, Kotz S, Kemp AW: Truncated negative binomial distributions. In Univariate discrete distributions. 2nd edition. Chapman and Hall, New York; 1992:225–227.
 11.
Schenzle D: Fitting the truncated negative binomial distribution without the second sample moment. Biometrics 1979, 35: 637–639.
 12.
Cai L, Huang H, Blackshaw S, Liu JS, Cepko C, Wong WH: Clustering analysis of sage data using a poisson approach. Genome Biology 2004, 5: 51. 10.1186/gb200457r51
 13.
Vencio RZN, Brentani A, Patrao AFC, Pereia CAB: Bayesian model accounting for withinclass biological variability in serial analysis of gene expressions (sage). BMC Bioinformatics 2004, 5: 119. 10.1186/147121055119
 14.
Ruijter JM, van Kampen AHC, Baas F: Statistical evaluation of sage libraries: consequences for experimental design. Physiol Genomics 2002, 11: 37–44.
 15.
Baggerly KA, Deng L, Morris JS, Marcelo Aldaz C: Differential expression in sage: accounting for normal betweenlibrary variation. Bioinformatics 2003, 19: 1477–1483. 10.1093/bioinformatics/btg173
 16.
Morris JS, Baggerly KA, Coombes KR: Bayesian shrinkage estimation of the relative abundance of mrna transcripts using sage. Biometrics 2003, 59: 476–486. 10.1111/15410420.00057
 17.
Caron H, van Schaik B, van der Mee M, Baas F, Riggins G, vanSluis P, Hermus MC, van Asperen R, Boon K, Voute KA, Heisterkamp A, van Kampen A, Versteeg R: The human transcriptome map:clustering of highly expressed genes in chromosomal domains. Science 2001, 291: 1289–1292. 10.1126/science.1056794
 18.
The human transcriptome map[http://bioinfo.amc.uva.nl/HTMseq/controller]
 19.
Liang P: Sage genie: a suite with panoramic view of gene expression. PNAS 2002, 99: 11547–11548. 10.1073/pnas.192436299
 20.
Venables WN, Ripley BD: General facilities for minimization. In Modern applied statistics with SPlus. 3rd edition. Springer, New York; 1999:267–269.
 21.
The bugs project[http://www.mrcbsu.cam.ac.uk/bugs/welcome.shtml]
 22.
Mathal AM, Moschopoulos PG: A form of multivariate gamma distribution. Annals of the Institute of Statistical Mathematics 1992, 44: 97–106. 10.1007/BF00048672
Acknowledgements
We thank our colleagues Barbera van Schaik, Jan Ruijter and Jan Koster for advice on the interpretation of the data descriptions.
Author information
Additional information
Authors' contributions
Both authors were involved in the development of the models. HT implemented the models and performed the computations. Both authors contributed to the manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Thygesen, H.H., Zwinderman, A.H. Modeling Sage data with a truncated gammaPoisson model. BMC Bioinformatics 7, 157 (2006). https://doi.org/10.1186/147121057157
Received:
Accepted:
Published:
Keywords
 Markov Chain Monte Carlo
 Sequencing Error
 Negative Binomial Model
 Bivariate Model
 Library Size