Bias correction and Bayesian analysis of aggregate counts in SAGE libraries
- Russell L Zaretzki^{1}Email author,
- Michael A Gilchrist^{2}Email author,
- William M Briggs^{3} and
- Artin Armagan^{4}
DOI: 10.1186/1471-2105-11-72
© Zaretzki et al; licensee BioMed Central Ltd. 2010
Received: 9 July 2009
Accepted: 3 February 2010
Published: 3 February 2010
Abstract
Background
Tag-based techniques, such as SAGE, are commonly used to sample the mRNA pool of an organism's transcriptome. Incomplete digestion during the tag formation process may allow for multiple tags to be generated from a given mRNA transcript. The probability of forming a tag varies with its relative location. As a result, the observed tag counts represent a biased sample of the actual transcript pool. In SAGE this bias can be avoided by ignoring all but the 3' most tag but will discard a large fraction of the observed data. Taking this bias into account should allow more of the available data to be used leading to increased statistical power.
Results
Three new hierarchical models, which directly embed a model for the variation in tag formation probability, are proposed and their associated Bayesian inference algorithms are developed. These models may be applied to libraries at both the tag and aggregate level. Simulation experiments and analysis of real data are used to contrast the accuracy of the various methods. The consequences of tag formation bias are discussed in the context of testing differential expression. A description is given as to how these algorithms can be applied in that context.
Conclusions
Several Bayesian inference algorithms that account for tag formation effects are compared with the DPB algorithm providing clear evidence of superior performance. The accuracy of inferences when using a particular non-informative prior is found to depend on the expression level of a given gene. The multivariate nature of the approach easily allows both univariate and joint tests of differential expression. Calculations demonstrate the potential for false positive and negative findings due to variation in tag formation probabilities across samples when testing for differential expression.
Background
Tag-based transcriptome sequencing libraries consist of a collection of short sequences of DNA called tags along with tabulated counts of the number of times each tag is observed in a sample. These observed tag counts represent a sample from a much larger pool of mRNA tags in a tissue or organism. In the past, Serial Analysis of Gene Expression (SAGE) was the most commonly used technology for generating libraries of tag counts. SAGE libraries have been used to address a number of biological questions including: estimating transcriptome size, and estimating the density of relative expression levels. Most frequently, SAGE was used to assess differential expression across cells from different tissues or strains, or cells grown under different experimental conditions. Next generation methods such as Digital Gene Expression (DGE) tag profiling [1] now provide a more efficient method to generate tag libraries and are growing in popularity. Libraries based on DGE are used to address the same questions and provide much larger numbers of tags leading to increased statistical power. However, the close similarities between DGE and SAGE, particularly the use of restriction enzymes, lead both techniques to share the same inherent biases.
As has been repeatedly noted in both SAGE and DGE studies e.g. [1–4], the mRNA from a single gene can lead to the production of more than one type of tag within the same library. Observing multiple tags from a single gene complicates data analysis and has a number of potential causes including alternative mRNA splicing and incomplete cDNA digestion. Incomplete digestion typically results in a number of different tags being formed from the same mRNA transcript. Alternative splicing can lead to multiple mRNA isoforms within a sample. When combined with incomplete digestion, each of these isoforms may give rise to a different tag or group of tags.
While it is intuitively sensible to analyze the prevalence of an mRNA by aggregating multiple tags derived from it, such tags are often analyzed separately [5]. At best, this approach diminishes the statistical power that is available when testing differences in mRNA abundances and reduces the accuracy of inferences based on asymptotic statistical methods (chi-square tests). More seriously, such comparisons can result in dramatic biases leading to erroneous conclusions in testing differential expression due to varying levels of incomplete digestion across libraries.
The current work extends an earlier analysis [6] based on SAGE libraries generated using the yeast Saccharomyces cerevisiae [7]. Alternative splicing is generally rare in microorganisms such as S. cerevisiae so that the tag multiplicity observed in these libraries is primarily attributed to incomplete cDNA digestion. The first objective of the current work is to provide a methodology that allows multiple tags, which arise due to incomplete digestion, to be combined and used to infer expression levels of the underlying mRNA transcripts. Aggregation is complicated by the fact that certain tags are more likely to form than others due to processing methods [4, 6]. Unless the tag formation probabilities are accounted for, inferences based on aggregate tag counts may be highly biased.
Previously, we have addressed the problem of estimation bias by directly modeling the variation in tag formation and deriving bias corrected maximum likelihood and posterior mode estimators of the composition of the mRNA population [6]. However, a number of assumptions and difficulties limit the application of these techniques. We address these in the present study. Three new hierarchical models for SAGE data are introduced that directly correct for the bias by embedding a model of tag formation within the probability distribution describing the data. Corresponding Gibbs sampling algorithms suitable for model fitting and inference are also derived. These methods are computationally efficient and allow a more general class of prior distributions than earlier analytic approaches. The results are also applicable for both aggregate and standard tag counts. We use simulation studies and analysis of publicly available data from S. cerevisiae [7] in order to contrast the accuracy of our sampling methods and analyze the effects of priors. The approach we present should serve as a building block for a more comprehensive method that takes into account the effects of both incomplete digestion and alternative splicing.
A second major objective of the current work is to explore the general effects of tag formation bias on differential expression. We carefully demonstrate that, in the absence of proper adjustments, variation in tag formation probability across experiments can lead to both false positive and false negative conclusions regarding the differential expression of a gene. A brief description of the implementation of proposed algorithms in differential expression analysis is also provided. While the approaches presented here require more work in processing and preparation than studies that focus on individual tags, they may offer a significant advantage with respect to statistical power in studies of differential expression.
A number of statistical methodologies have been developed to analyze SAGE libraries. For studies that focus on relative expression levels, individual tags are viewed as possible outcomes from a multinomial sampling model [7, 8]. [9] offered an improvement on this by directly applying a Bayesian multinomial-mixture Dirichlet model to the observed vector of tag counts which improves accuracy in estimation of tag frequency. Marginal analysis may be more useful in studies of transcriptome size and was developed by [10] and [11, 12]. Statistical methods to assess differential expression across a number of samples were discussed by [13–16]. These analyses typically test the differential expression of each tag across libraries independently; see [14].
The SAGE Methodology
The goal of a SAGE experiment [8] is to sample the mRNA population within a group of cells. Broadly speaking, the SAGE method generates a set of short sequence-based cDNA tags from the mRNA population. Initially, a pool of mRNA is extracted from a group of cells. The unstable single stranded mRNA is reverse transcribed into a double stranded DNA copy (cDNA) using a modified primer that allows the cDNA to be bound to a streptavidin bead. Using restriction enzymes, the bead-bound portion of cDNA is cut into small 'tags' that are concatenated together into longer cDNA molecules referred to as concatemers. These concatemers are amplified using PCR and then sequenced. A cleavage motif of the anchoring enzyme allows the start and stop points of tags to be identified within a concatemer. Each time a tag from a specific gene is observed in the sequence data, it contributes a count to the library. The DGE technology follows a similar procedure in the first several steps. However, tags are neither concatenated nor cloned but sequenced immediately.
In both technologies the restriction enzymes used to create the tags cut at very specific sites within the cDNA. For example, the restriction enzyme used by [8] cleaves only at the four nucleotide motif CATG. Thus, tags can only be created at specific points within a cDNA. Because the site where the cDNA is cut serves as an 'anchor' between tags, the restriction enzyme used is often referred to as the anchoring enzyme (AE). We refer to the specific points in the cDNA cleaved by the anchoring enzyme as AE sites.
A further complication is that many of these sites lead to non-unique or 'ambiguous' tags that cannot be readily assigned to a particular gene. Ambiguous tags are 'uninformative' using current technologies; tag-to-gene mapping is discussed at length in [17]. The tags generated in the experiment analyzed below were 14 bp long leading to a number of ambiguous tags. Experimental advances, for example, 'SuperSAGE' techniques can lead to tags up to 26 bp long [18, 19] significantly diminishing this problem.
Modeling Tag Formation Probabilities
The previous discussion leads us to postulate that genes with larger numbers of AE sites have a increased probability of tag formation and further that there is large variation in the probability of tag formation within a gene. Below we present a quantitative model for tag formation probabilities within a gene that is key to later bias correction methods.
Because only cDNA that is attached to a streptavidin bead is retained during the experimental process, tags are created from the 3' most AE site that is actually cleaved (Figure 1). Let k_{ i }be the number of restriction enzyme sites which may be cleaved by the AE on mRNA generated from gene i, let p be the probability the AE will cleave a site, and assume that cleavage probability is independent between sites and does not vary between genes. If we label sites 1 through k_{ i }starting at the 3' most site (i.e. the site closest to the bead; see Figure 1), the probability of generating a tag through AE cleavage at site j ∈ {1, ..., k_{ i }} follows a geometric distribution, (1 - p)^{j-1}p. This corresponds to the probability of no cleaving in sites 1 to j - 1, and a cleaving at site j. Note that this probability is independent of what happens at the AE sites 5' from the j th site. The fact that the expected distribution of tags varies with AE cleavage probability p can be used to estimate p from the library of tag counts; see [6].
where the sum is over all non-ambiguous tagging sites. We refer to this quantity as the tag formation probability.
For purposes of our analysis, we will assume that, because each of the thousands of genes provides an independent estimate, the value of p is known to a very high precision. The variation in the tag formation probability between genes stems from variation in the number of AE sites as well as the number of ambiguous tagging sites. Given p, the only uncertainty in the estimate of ϕ_{ i }results from ambiguous tags; this problem may be partially eliminated through techniques that generate longer tags; see [19]. Hence, for methodological purposes we reasonably assume the ϕ_{ i }to be known constants (see [6] for a more details on these calculations). As a result, the tag pool represents a biased sample of the mRNA population of interest being a function of both the known distribution of tag formation probabilities across the genome, and the distribution of these genes within the mRNA population itself.
Traditionally, a gene's frequency in the tag pool was assumed to be an unbiased estimate of its frequency in the mRNA population, the population of interest. This equivalence only holds when all genes have the same tag formation probability, a condition that is never met. Consequently, genes with greater than average tag formation probabilities will be over-represented while those with lower than the average probability will be under-represented. Equation (1) allows us to accurately estimate this sampling bias and correct the inferences made from tag libraries.
While computation of the maximum likelihood estimates (MLE's) is relatively straightforward, constructing confidence intervals for m is difficult. The existence of the normalization term ∑_{ j }m_{ j }ϕ_{ j }in the denominator of the left hand side of Eq. (4) makes computation of Fisher's information matrix taxing, particularly for the large dimensional vectors encountered when working with SAGE data (i.e. l is of the order 10^{3} to 10^{4}). In addition, because the number of categories is generally within an order of magnitude of, or possibly even greater than, the number of tags sampled, most categories have either zero or only a few observations. These relatively small counts call into question inferences such as chi-squared tests which are based on asymptotic approximations.
[6] introduces methods for the direct maximization of this quantity and discusses the choice of a prior and its consequences on the marginal inferences for individual values of m_{ i }. However, a number of difficulties severely limit the range of prior parameters that can be used; i.e. those priors with appreciable weight relative to the observed sample sizes. Given the bias in the observed data, the Gibbs sampling based approaches for inference based on the posterior of the mRNA proportions introduced here are more flexible, and require fewer assumptions than analytic techniques. The methods are also highly computationally efficient. While the assumption of independent cleaving resulted in a geometric model, it is important to note that alternative models of tag formation can be substituted into the Bayesian algorithms without making any further adjustments to the model.
Shrinkage Estimation in Multivariate Models
Joint estimators of the relative proportions of individual tags in SAGE experiments based on posterior means from a conjugate Dirichlet-Multinomial model were explored by [9] and have similar properties to the methods proposed here. In the case of flat priors (i.e. α_{ i }= 1 for all i), they note that while the posterior mean provide improved average accuracy with respect to sums of squared errors across all categories, estimates of categories where large counts are observed were shrunk excessively in order to boost the estimated probabilities of cells with few or no counts. Due to the massive number of categories and the extreme imbalance in observed frequencies (e.g. more than half of the categories have zero counts in our data), the estimates for frequently observed genes tend to consistently underestimate the true proportions while the reverse is true of genes with very low expression proportions. This observation highlights the main weakness of shrinkage estimators such as the posterior mean. While they perform better on average across categories, they may perform poorly on particular categories that may be of primary interest. [9] addresses this issue by proposing a mixture of Dirichlet-Multinomial's but does not discuss the issue of sampling bias. Note that after aggregation, the large reduction in the number of categories significantly reduces the side effects of shrinkage leading to improved inferences. These observations will be useful in analyzing the bias corrected models proposed below.
Results and Discussion
Hierarchical Models and Gibbs Sampling
Three hierarchical approaches are proposed to model aggregate tag counts while taking into account the tag formation probability. The Dirichlet-Poisson-Binomial model (DPB) extends the method of [10], who modeled tag counts as independent random variables from a Poisson-Gamma mixture to a multivariate setting that models the joint distribution of tags. The Dirichlet-Multinomial-Binomial model (DMB) views the unobserved true counts of mRNA in the population as being derived from a multinomial distribution. The missing-data model (MD) assumes that if the number of mRNA transcripts that were not converted into tags due to incomplete digestion by the anchoring enzyme, denoted r, were known then the data could be modeled as a standard Dirichlet-Multinomial model. By proposing a Poisson distribution for r, posterior inferences can be made. Detailed descriptions for these algorithms are given in the Methods section. Gibbs samplers based on these algorithms all produce posterior samples of m, the vector of relative proportions of each gene in the mRNA population. In addition, DPB produces samples representing N, the hypothetical total number of mRNA transcripts in the population while MD produces posterior samples of r, the number of unconverted transcripts. These quantities may be relevant in studies of the number of unique transcripts.
A Dirchlet prior distribution was assumed for the proportion vector m in all three algorithms. Two parameter vectors = (α_{1}, ..., α_{ l }) were tested. The flat prior sets α_{ i }= 1 for all i while the tub prior sets α_{ i }= 1/l. The flat prior effectively assumes that a hypothetical previous experiment produced 1 tag for each category. As the name suggests, the flat prior represents a uniform density over the parameter space. The tub prior assumes that the prior information is equivalent to that of a single tag. The density of this prior takes a bowl or tub shape. The preponderance of mass along the edges forces the estimates of m_{ i }to be small unless significant counts are observed for that gene.
Both Figures 2 and 3 show that modal estimates typically lie slightly above the 97.5% bound of the simulated marginal posterior distribution. The reader might wonder why the multivariate mode and mean are so far apart. The explanation lies in the fact that posterior distribution is spread across a a large parameter space consisting of many thousands of dimensions. While the posterior distribution has a skewed density with a very slight slope, it is essentially flat over most of the parameter space. While the mode is located at the most likely point, an extreme point where only a fraction of the categories have non-zero tags, the mass spread across the rest of the space pulls the mean away from this point leading to the observed separation. This is similar to the effect of skewness in a one dimensional distribution which pulls the mean away from the mode.
Graphical displays based on the results of the DMB and MD algorithms had very similar appearances and so are not shown. However, meaningful differences in posterior means generated by the methods and across the two priors do exist and are discussed in the following section.
Autocorrelation in Gibbs samples of proportions.
DPB | DMB | MD | ||||
---|---|---|---|---|---|---|
Lag | ( α = 1) | ( α = 1/ l ) | ( α = 1) | ( α = 1/ l ) | ( α = 1) | ( α = 1/ l ) |
10 | 0.011 | 0.000 | 0.015 | 0.019 | 0.053 | 0.008 |
20 | 0.047 | 0.000 | -0.025 | 0.013 | -0.012 | 0.010 |
40 | -0.001 | -0.028 | -0.019 | 0.013 | -0.034 | 0.012 |
80 | 0.000 | 0.027 | 0.000 | -0.007 | 0.011 | -0.010 |
Autocorrelation in Gibbs samples of population size.
DPB | MD | |||
---|---|---|---|---|
N | N | r | r | |
Lag | ( α = 1) | ( α = 1/ l ) | ( α = 1) | ( α = 1/ l ) |
10 | 0.905 | 0.024 | 0.606 | 0.007 |
20 | 0.863 | 0.018 | 0.344 | -0.012 |
40 | 0.783 | -0.029 | 0.088 | -0.034 |
80 | 0.666 | 0.027 | -0.011 | 0.011 |
In terms of computational efficiency, with l = 6, 178, 5 million samples from the DPB algorithm required 21.7 and 19.3 minutes with the flat and tub priors respectively. The DMB was slightly slower taking 30.7 and 25.2 minutes to run. Finally, the MD method completed 5 million samples in 20.2 and 19.3 minutes, respectively.
In order to avoid any effects due to autocorrelation our real data experiments stored only the last of every hundred samples for analysis. Inference for means was based upon the final 1000 of these stored samples drawn after an extensive burn-in period of 400,000 simulations. Because the vector m is usually the quantity of interest, far fewer samples would be needed in practice. This would significantly reduce the computation times for the algorithm.
A Simulation Study
Although all methods produce similar trends in their estimates, quantitative differences do exist. It is also important to quantify the effects of the prior distributions on the analysis. Because the inferential quality is often more relevant in scientific investigations of relative and differential expression than point estimates, we attempt to contrast inferential quality of the three methods by examining marginal 95% posterior intervals constructed for m_{ i }from simulated libraries. The proportion of times the computed interval covers the true known simulation proportion m_{ i }is then recorded. Accuracy is assessed by how close the observed coverage percentage is to the desired level of 95%. The average interval length of the 95% posterior intervals was considered as a second criteria of inferential accuracy but no meaningful differences in average interval length were observed among the three methods.
Simulated coverage probabilities for proposed methods.
Proportion m_{ i }Range | Library Size | Gene Count | Prior | |||||
---|---|---|---|---|---|---|---|---|
Flat α _{ i } = 1 | Tub α _{ i } = 1/ k | |||||||
DPB | DMB | MD | DPB | DMB | MD | |||
0 - 1.67 × 10^{-5} | 6178 | 1181 | 0.800 | 0.855 | 0.854 | 0.104 | 0.104 | 0.104 |
- | 1000 | 25 | 0.825 | 0.834 | 0.834 | 0.107 | 0.108 | 0.107 |
1.67 × 10^{-5} - 1.23 × 10^{-4} | 6178 | 3678 | 0.947 | 0.975 | 0.976 | 0.520 | 0.520 | 0.520 |
- | 1000 | 209 | 0.9425 | 0.950 | 0.950 | 0.558 | 0.558 | 0.558 |
1.23 × 10^{-4} - 9.12 × 10^{-4} | 6178 | 1173 | 0.961 | 0.951 | 0.948 | 0.899 | 0.898 | 0.898 |
- | 1000 | 578 | 0.952 | 0.957 | 0.957 | 0.911 | 0.911 | 0.911 |
9.12 × 10^{-4} - 6.74 × 10^{-3} | 6178 | 133 | 0.939 | 0.485 | 0.479 | 0.945 | 0.944 | 0.944 |
- | 1000 | 165 | 0.950 | 0.950 | 0.945 | 0.934 | 0.934 | 0.934 |
6.74 × 10^{-3} - 1.35 × 10^{-1} | 6178 | 13 | 0.809 | 0.009 | 0.005 | 0.953 | 0.951 | 0.951 |
- | 1000 | 23 | 0.948 | 0.794 | 0.780 | 0.943 | 0.939 | 0.939 |
Because these methods are very efficient to compute, both methods can be applied to analyze a single library with the tub prior being relied upon for larger count categories while the flat prior is used for the remaining categories.
Differential Expression in SAGE between Libraries
Tag based transcriptome sequencing such as SAGE is most commonly used to identify differential expression of genes across groups of libraries. It is important to point out the consequences of differences in tag formation probabilities in such studies along with the advantages of the proposed methods in this context.
where is the tag formation probability for tag j in experiment A, Ω is the true odds ratio, and k_{ j }is the number of AE positions upstream from which the tag was derived. The approximation holds when both p_{ a }and p_{ b }are large (i.e. > 0.5) and the i^{th} expression level m_{ i }is small in an absolute sense for both experiments, conditions that are almost universally met in SAGE studies.
If only data derived from the 3' most tagging site of each gene is analyzed so that k = 0, and ϕ_{ ja }= p_{ a }then the odds simply reduces to (m_{ ia }(1 - m_{ ib }))/(m_{ ib }(1 - m_{ ia })), the exact odds ratio desired. However, suppose all tags are allowed and the experimenter tests differential expression for the i^{th} tag, which derives from an AE site 1 position upstream from the 3' end. The tag formation probability now takes the form p(1 - p). Assuming cutting probabilities p_{ a }= .92 and p_{ b }= .96 we see that our estimate of the odds ratio is overestimated by a factor of ~1.9 potentially leading to a false positive diagnosis. In fact, this phenomenon could result in a sizeable proportion of both false positive and false negative conclusions. If SAGE analysis were restricted to 3' most AE sites and all data from upstream tags were eliminated from the library, ϕ_{ i }would be constant and sampling bias would be eliminated by analysis using odds ratios (logistic regression coefficients are estimated log-odds ratios). Once again, the caveat to this approach is that a significant proportion of the observed tags may be discarded. The need for the adjustments we propose stems from an effort to accurately combine multiple tags arising from the same mRNA and offers the ability to utilize a much larger fraction of the data collected leading to more power for differential testing.
Monte-Carlo sampling mechanisms such as DPB can easily be applied to evaluate differential expression across groups of libraries. In the case of comparison for gene i across two libraries, individual Monte Carlo samples from each library can be differenced, m_{i 1}- m_{i 2}, and the distribution of these differences used for inference; see [[9], Section 6.]. As in a paired t-test, if the bulk of differences fall far from 0 we may conclude that differential expression exists. Alternatively, it may be more appropriate to form odds ratios based on the sampled quantities and see if the bulk of these are far from 1.
The Monte-Carlo sampling nature of the algorithms presented here and their computational efficiency leads to several advantages over recent methods [14–16]. First, and foremost the ability to use a larger fraction of the data collected while simultaneously correcting bias is critical to improving power. In addition, MCMC samples allow both marginal and joint testing for differential expression across all genes using a single sampling run for each library involved. Joint testing of differential expression is possible because posterior samples represent the joint posterior distribution of the proportions m. Importantly, our method requires no asymptotic conditions to be met in order to guarantee the validity of the results.
Conclusions
This work focuses on a general biasing mechanism that may have a significant impact on the interpretation of libraries generated by SAGE and closely related methods. Beyond drawing attention to the consequences of incomplete digestion, two main contributions are presented. First, we derive and analyze three Bayesian algorithms that provide corrected posterior inference for relative proportions of genes or tags in the overall population. Second, through calculations we deduce the consequences of tag formation bias on testing of differential expression and discuss how the algorithms derived here can be used to correctly assess differential expression.
This work compliments the earlier work in [6] by providing a flexible and efficient methodology which extends the range of the prior distributions that are available. It also allows researchers access to interval estimates in order to assess uncertainty in the estimated quantities.
The proposed methods are appropriate when incomplete digestion is the main source of multiple tag generation. They may be applied to both tag level data and to inference at the gene level based on aggregated counts. If possible, aggregating tags will provide better inferences due to larger available counts. Of the three approaches, the DPB method provided the most accurate confidence intervals over the widest range of relative proportions and we recommend its exclusive use. In addition, the choice of prior also plays an important role in the effectiveness of the algorithm. Of the two priors tested, the flat prior was effective over a wide range of categories but resulted in excessive shrinkage for the most abundant categories. For a fixed number of sampled tags, this shrinkage effect becomes more severe as one increases the number of categories. Hence, aggregating tags provides a second advantage since it greatly reduces the number of categories used in an analysis. The proposed tub prior provides little shrinkage and posterior intervals based on this prior are very accurate for abundant categories. In practice, both priors should be applied and inference should be based on the samples with the larger average.
We note that the sampling algorithms presented are independent of the model for tag formation probability and models other than the geometric model discussed can easily be substituted. It should also be possible to extend the proposed methods using a mixture methodology as suggested by [9].
Correction of tag formation bias may play a significant role in improving power in differential expression studies based on SAGE or related tag libraries. To ensure valid testing, one must either include bias correction or eliminate non-3' tag counts from the analysis. Furthermore, the ability to effectively test for changes in groups of genes (gene networks) is now possible due to the multivariate nature of the posterior distribution. This advantage may become significantly more important as more elaborate studies and techniques are developed in the future.
Due to the overall simplicity of genome architecture, alternative splicing is generally not a problem when analyzing gene expression in S. cerevisiae and other microorganisms. However, alternative splicing is widespread in most multicellular organisms [1, 2] and greatly complicates transcriptome analysis. While the adjustments we propose do not solve the interpretation problems caused by alternative splicing, properly identifying multiple tags from common mRNA is a necessary first step in doing so. A logical next step would be to expand the tag formation model underlying this work by allowing a single gene to produce multiple mRNA isoforms instead of a single type. One approach would be to extend our model in a manner similar to the approach developed for analyzing mRNA sequence data developed by [20].
Extending our model in such a way should allow us to make inferences about the distribution of mRNA isoforms produced by a gene based on the distribution of tags experimentally observed.
Conceptually, the problems caused by alternative splicing are similar to those caused by tag ambiguity between different genes. For example, in the case of alternative splicing, the challenge is in determining the mRNA isoform from which each tag originates. Similarly, in the case of ambiguous tags, the challenge is in determining the gene from which each tag originates. Thus it is plausible that data interpretation problems caused by both alternative splicing and ambiguous tags could be overcome by extending our model to include multiple sources of a single tag.
Methods
The Data
We explicitly consider three strategies to simulate from the posterior distribution given by Eq. (5). The algorithms discussed are tested for inferential accuracy using simulated data as well as being applied to a published SAGE data set that analyzes a yeast organism S. cerevisiae [7]. The published data was collected from the log-phase. There are 6,178 genes included in the data. The maximum tag count was 392 and the minimum was zero. 3,560 of the genes included had observed counts of 0. Gene dependent sampling probabilities range from ≈ .3% to 100%. Tag assignments to unique genes and assignment of non-unique tags for this analysis were described in [6].
Computing and Software
All simulations described below were computed using R-environment (Version 2.6.2) on an 8 core, Intel Xeon 2.66 GHz desktop server running Linux Ubuntu 8.04. R-scripts that implement the Gibbs sampling algorithms described below are provided; see Additional file 1.
The Dirichlet-Poisson-Binomial (DPB) Model
The DPB approach, inspired by Casella and George [21], assumes a fixed population of mRNA's of size N exists within the sampled cells. The total size N is random across samples and follows a gamma distribution that is rounded down to the nearest integer. The choice of gamma here is convenient due to its role as a conjugate prior to the Poisson distribution. Given N, the relative proportions of the categories of mRNA, m_{ i }, are unknown and assumed to follow a Dirichlet distribution with prior = (α_{1}, ..., α_{ l }). The α_{ i }will typically be identical resulting in objective inference.
Because cells contain mRNA transcripts from thousands of different genes, the probability of seeing any particular type is low. It is therefore logical to assume, given N and m_{ i }, that the actual number of mRNA's of a certain type g_{ i }, i = 1, ... l, extracted from the group of cells, satisfies a Poisson distribution with mean μ = N·m_{ i }. Finally, the restriction enzyme process generates a tag count t_{ i }for a particular mRNA transcript in a binomial fashion based upon the tag formation probability ϕ_{ i }. Hence, the hierarchical data generating mechanism follows, T_{ i }~Binomial(g_{ i }, ϕ_{ i }), g_{ i }~Poisson(N * m_{ i }), m ~D_{ l }( ), and N ~ceiling(Gamma(γ_{1}, γ_{2})). We refer to this model as the Dirichlet-Poisson-Binomial (DPB) model.
where δ = ( , γ_{1}, γ_{2}). Gibbs sampling can be implemented by iteratively simulating from each conditional after replacing any unknown random quantities in the conditioning set with simulation values from the previous iteration; see [22].
The hierarchical model considered here is not identical to Eq. (3), but depends upon prior parameters (γ_{1}, γ_{2}) which effect the mode of the posterior distribution. One possibility for choosing values of these parameters is to minimize the distance between the approximate analytical mode discussed in [6] and the mode of this model which can be computed through iterative maximization of the conditionals; see [23, 24] for further details.
In our simulations we used a shape parameter γ_{1} = 100 and prior scale γ_{2} = 200. The mean of a gamma with these parameters is 20,000 which is somewhat larger than a natural estimate of the population size, ∑_{ i }(t_{ i }/ϕ_{ i }) = 16438.81. This was chosen intentionally to determine if the posterior would converge to a reasonable estimate of N.
The Dirichlet-Multinomial-Binomial (DMB) Model
where m *(1 - ) is an element-wise product with 1 representing a vector of identical dimension to m. Hence, the i^{th} element of the vector is given by m_{ i }(1 - ϕ_{ i }).
This formulation is more natural in the sense that it restricts the pre-tagging counts to sum to the population total N. One drawback is that the observed data provides no information for posterior inference about the population size N. Iterative maximization for mode calculation is not viable due to the discrete nature of the multinomial conditional distribution.
The Missing Data (MD) Model
In order to apply the Gibbs sampler a value for the prior mean of r, μ must be selected. A logical mean for the Poisson is t_{ tot }∑_{ i }m_{ i }(1 - ϕ_{ i })/∑_{ i }m_{ i }ϕ_{ i }since ∑m_{ i }(1 - ϕ_{ i }) = 1 - ∑m_{ i }ϕ_{ i }is the probability that an mRNA is not converted into a tag. Equation (4) provides a useful estimate of ∑m_{ i }ϕ_{ i }in the case of a flat prior.
Like the DPB algorithm above, the missing data algorithm also admits an iterative optimization procedure in order to compute the posterior mode. The mode of the Dirichlet conditional given by [22] is = (t_{ i }+ α_{ i }- 1)/(∑_{ i }(α_{ i }+ t_{ i }) + r - l) while the mode for the Poisson is the mean rounded down to the nearest smaller integer. For example, considering a flat prior for m along with a prior mean of μ = 10, 100 for r, the posterior mode of the missing data model is nearly identical to the exact analytical mode given in [6]. It is tempting to invoke the argument used to derive Eq. (7) to obtain the marginal mode.
Unfortunately, when α_{ i }< 1 and t_{ i }= 0 this leads to estimates outside of the parameter space.
Simulation Study Protocol
Simulation of libraries was based upon the proportions estimated from the aggregated S. cerevisiae log-phase data described earlier. First, long and short multinomial proportion vectors m of lengths l = 6, 178 and 1, 000 respectively were constructed. Actual mean estimates from the log-phase library were adopted for the case l = 6, 178 while for l = 1, 000, m was based on renormalizing the first 1,000 elements of the longer case. The vector of known tag formation probabilities, {ϕ_{1}, ..., ϕ_{ l }} was constructed by simulating a theoretical number of AE cleaving sites and then using Eq. (1) to compute individual probabilities ϕ_{ i }. The number of cleaving sites followed a Poisson distribution with mean 2 with 1 count added to each category. This ensured that each "gene" has at least one cleaving site. The cleavage probability was set at p = .55 based on estimates given in [6].
Using the above protocol, for each combination of library size, l, and prior, 1,000 libraries, t_{ j }, each containing n = 15, 000 tags were simulated. For each simulated library, each of the three proposed algorithms was used to generate 20,500 posterior sample vectors m_{ jk }, k ∈ 1, ...20, 500. Every 20th sample was collected after a burn in period of 500 samples resulting 1,000 Monte Carlo samples of m for each method. Samples were then used to generate mean, median and 95% marginal posterior intervals for each of the l gene categories. Coverage percentages for the i^{th} gene category are computed by finding the proportion of times that the interval for category i from a particular method contains the true proportion m_{ i }across the 1,000 simulated libraries t_{ j }. Figure 4 displays a sample of coverage percentages across a range of categories for the DPB method under two different prior values.
Declarations
Acknowledgements
Authors would like to thank Robert Mee, Naomi Altman, Reviewers and Editors for helpful comments which significantly improved the quality of the manuscript.
Authors’ Affiliations
References
- 't Hoen PAC, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RHAM, de Menezes RX, Boer JM, van Ommen GJB, den Dunnen JT: Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Research 2008, 36(21):e141. 10.1093/nar/gkn705View ArticlePubMedPubMed CentralGoogle Scholar
- Pauws E, van Kampen A, Graaf S, de Vijlder J, Ris-Stalpers C: Heterogeneity in polyadenylation cleavage sites in mammalian mRNA sequences: implications for SAGE analysis. Nucleic Acids Research 2001, 29(8):1690–1694. 10.1093/nar/29.8.1690View ArticlePubMedPubMed CentralGoogle Scholar
- Boon K, Osorio E, Greenhut S, Schaefer C, Shoemaker J, Polyak K, Morin P, Buetow K, adn S de Souza RS, Riggins G: An anatomy of normal and malignant gene expression. Proceedings of the National Academy of Science 2002, 99(17):11287–11292. [SAGE Genie] [SAGE Genie] 10.1073/pnas.152324199View ArticleGoogle Scholar
- Malig R, Varela C, Agosin E, Melo F: Accurate and unambiguous tag-to-gene mapping in serial analysis of gene expression. BMC Bioinformatics 2006., 7(487):Google Scholar
- Romualdi C, Bortoluzzi S, d'Alessi F, Danieli G: IDEG6: a web tool for detection of differentially expressed genes in multiple tag sampling experiments. Physiological Genomics 2003, 12: 159–162.View ArticlePubMedGoogle Scholar
- Gilchrist MA, Qin H, Zaretzki RL: Modeling SAGE tag formation and its effects on data interpretation within a Bayesian framework. BMC Bioinfomatics 2007., 8(403):Google Scholar
- Velculescu VE, Zhang L, Zhou W, Vogelstein J, Basrai MA, Bassett JDE, Hieter P, Vogelstein B, Kinzler KW: Characterization of the yeast transcriptome. Cell 1997, 88: 243–251. 10.1016/S0092-8674(00)81845-0View ArticlePubMedGoogle Scholar
- Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial Analysis of Gene Expression. Science 1995, 270(5235):484–487. 10.1126/science.270.5235.484View ArticlePubMedGoogle Scholar
- Morris JS, Baggerly KA, Coombes KR: Bayesian shrinkage estimation of the relative abundance of mRNA transcripts using SAGE. Biometrics 2003, 59: 476–486. 10.1111/1541-0420.00057View ArticlePubMedGoogle Scholar
- Thygesen HH, Zwinderman AH: Modeling Sage data with a truncated gamma-Poisson model. BMC Bioinformatics 2006, 7: 157. 10.1186/1471-2105-7-157View ArticlePubMedPubMed CentralGoogle Scholar
- Kuznetsov VA, Knott GD, Bonner RF: General Statistics of Stochastic Process of Gene Expression in Eukaryotic Cells. Genetics 2002, 161(3):1321–1332.PubMedPubMed CentralGoogle Scholar
- Kuznetsov VA: Statistical Methods in Serial Analysis of Gene Expression(SAGE). In Computational and Statistical Approaches to Genomics. 2nd edition. Edited by: Zhang W, Shmulevich I. New York: Springer Verlag; 2006:163–2008. full_textView ArticleGoogle Scholar
- Baggerly KA, Deng L, Morris JS, Aldaz CM: Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics 2003, 19: 1477–1483. 10.1093/bioinformatics/btg173View ArticlePubMedGoogle Scholar
- Baggerly KA, Deng L, Morris JS, Aldaz C: Overdispersed logistic regression for SAGE: Modelling multiple groups and covariates. BMC Bioinformatics 2004, 5: 144. 10.1186/1471-2105-5-144View ArticlePubMedPubMed CentralGoogle Scholar
- Lu J, Tomfohr JK, Kepler TB: Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model appraoch. BMC Bioinformatics 2005., 6(165):Google Scholar
- Vencio RZN, Brentani H, Patrao DFC, Pereira CAB: Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression (SAGE). BMC Bioinformatics 2004., 5(119):Google Scholar
- Vencio RZN, Brentani H: Statistical Methods in Serial Analysis of Gene Expression(SAGE). In Computational and Statistical Approaches to Genomics. 2nd edition. Edited by: Zhang W, Shmulevich I. New York: Springer Verlag; 2006:209–233. full_textView ArticleGoogle Scholar
- Matsumura H, Reich S, Ito A, Saitoh H, Kamoun S, Winter P, Kahl G, Reuter M, Kruger DH, Terauchi R: Gene expression analysis of plant host-pathogen interactions by SuperSAGE. Proceedings of the National Academy of Sciences 2003, 100(26):15718–15723. [http://www.pnas.org/cgi/content/abstract/100/26/15718] 10.1073/pnas.2536670100View ArticleGoogle Scholar
- Matsumura H, Bin Nasir KH, Yoshida K, Ito A, Kahl G, Kruger DH, Terauchi R: SuperSAGE array: the direct use of 26-base-pair transcript tags in oligonucleotide arrays. Nature Methods 2006, 3: 469–474. 10.1038/nmeth882View ArticlePubMedGoogle Scholar
- Jiang H, Wong WH: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 2009, 25(8):1026–1032. 10.1093/bioinformatics/btp113View ArticlePubMedPubMed CentralGoogle Scholar
- Arnold SF: Gibbs Sampling. In Handbook of Statistics. Volume 9. Edited by: Rao C. New York, NY: Elsevier/North-Holland; 1993:599–625. 10.1016/S0169-7161(05)80142-7Google Scholar
- Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. 2nd edition. Texts in Statistical Science, Boca Raton, FL: Chapman & Hall/CRC; 2004.Google Scholar
- Lindley DV, Smith AFM: Bayes Estimates for the Linear Model. Journal of the Royal Statistical Society. Series B (Methodological) 1972., 34:Google Scholar
- Chen MH, Shao QM, Ibrahim JG: Monte Carlo Methods in Bayesian Computation. New York, NY: Springer Verlag; 2000.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.