A Bayesian nonparametric method for prediction in EST analysis

Background Expressed sequence tags (ESTs) analyses are a fundamental tool for gene identification in organisms. Given a preliminary EST sample from a certain library, several statistical prediction problems arise. In particular, it is of interest to estimate how many new genes can be detected in a future EST sample of given size and also to determine the gene discovery rate: these estimates represent the basis for deciding whether to proceed sequencing the library and, in case of a positive decision, a guideline for selecting the size of the new sample. Such information is also useful for establishing sequencing efficiency in experimental design and for measuring the degree of redundancy of an EST library. Results In this work we propose a Bayesian nonparametric approach for tackling statistical problems related to EST surveys. In particular, we provide estimates for: a) the coverage, defined as the proportion of unique genes in the library represented in the given sample of reads; b) the number of new unique genes to be observed in a future sample; c) the discovery rate of new genes as a function of the future sample size. The Bayesian nonparametric model we adopt conveys, in a statistically rigorous way, the available information into prediction. Our proposal has appealing properties over frequentist nonparametric methods, which become unstable when prediction is required for large future samples. EST libraries, previously studied with frequentist methods, are analyzed in detail. Conclusion The Bayesian nonparametric approach we undertake yields valuable tools for gene capture and prediction in EST libraries. The estimators we obtain do not feature the kind of drawbacks associated with frequentist estimators and are reliable for any size of the additional sample.


Background
Expressed Sequence Tags (ESTs) are generated by partially sequencing randomly isolated gene transcripts that have been converted into cDNA. From their introduction in Adams et al. [1], ESTs have played an important role in the identification, discovery and characterization of organisms as they provide an attractive and efficient alternative to full genome sequencing. The resulting transcript sequences and their corresponding abundances are the main focus of interest providing the identification and level of expression of genes. Important issues to be addressed in terms of design of a future study are: (1) a comparison among different cDNA libraries of the same organism with the aim of detecting the least redundant library; (2) the determination of an "optimal" number of genes to be sequenced by the experimenter. Indeed, these issues are relevant since, despite the novel advances in technology, see [2], sequencing is still expensive and therefore suitable cost-effectiveness thresholds must be established. This suggests that there is the need for assessing the relative redundancy of various libraries prepared from the same organism in order to detect which one yields new genes at a higher rate. Indeed, there are 'normalization' protocols which aim at making the frequencies of genes in the library more uniform thus typically improving the discovery rate. However, performing such protocols is also expensive. Hence, the decision, whether to proceed with sequencing of a non-normalized library or to resort to a normalization procedure, has to balance carefully the involved costs: such a decision is necessarily based on statistical estimates of the coverage of the given sample, of the expected number of new genes in a future sample and on the future discovery rate. Note that ideally one would like to sequence the smallest possible portion of the library and, based on the outcome, predict the tentative future sequencing well beyond the size of the given dataset.
The practical issues discussed above naturally translate in statistical problems which, given an initial sample of EST, can be described as follows: a) Coverage: Coverage can be seen as the proportion of genes in the library represented in the initial sample or, equivalently, the probability that a new read will not produce a new gene. The coverage estimate provides a first description of redundancy of the library. b) Expected number of new genes: Having observed an initial sample of size n generated from the cDNA library and estimated its coverage, prediction of outcomes of further reads is in order. The first question to answer is: 'How many new unique genes are expected to be detected in an additional EST dataset of targeted size m?' Such estimates provide, then, an overall measure of redundancy of the library with reference to a further EST survey. c) Discovery rate: In addition to the expected number of genes in a future sample of size m, it is also important to establish the rate at which the probability of discovering a new gene decays as more and more reads are recorded. In other words, interest lies in determining the probability that the (n + m + 1)-th read leads to a new gene, given the observed initial sample of size n and regardless of the experimental outcome yielded by the m intermediate draws. The availability of the discovery rate as a function of the size of the future sample m, represents then a pointwise predictive measure of the evolution of redundancy as the sequencing ideally proceeds.
Note that the combination of the measures under b) and c) provides a natural guideline for selecting the size of a future sample m. Supposing the targeted number of new genes is j, the estimator in b) guides the selection of the minimum sample size which should lead to j new genes. Then, one can resort to the discovery rate: in case it is relatively low around , it may be convenient to reduce the size of the future sample in a way that the discovery rate does not fall below a threshold suggested by the problem at issue. In such a case, one obviously would not be able to achieve the targeted number j of newly observed genes: costs considerations may indeed suggest that, with such a low discovery rate, sequencing is too expensive. On the other hand, if the discovery rate around is still relatively high, one may decide to enlarge the survey size. Moreover, the information conveyed by b) and c) is useful in comparing libraries and, again, it is worth considering these estimates together. Indeed, suppose we have to compare two libraries and that, for a fixed size m of the additional sample, library 1 yields a larger expected number of new genes but a lower discovery rate in comparison with library 2. If the sample size m is increased to m + m', for m' sufficiently large, the comparison between the two libraries can lead to different conclusions in the sense that a larger number of new genes is predicted for library 2. This happens because library 1 features a lower discovery rate, which implies that, within the additional m' draws, the expected number of new genes is lower for library 1. With reference to 'normalization' protocols, this means that the decision whether to carry it out or not should also depend on the foreseen sample size. For instance, the normalized Mastigamoeba balamuthi data we analyze exhibit a higher discovery rate, with respect to the non-normalized one, for small m. But, since the discovery rate has a faster decay, it appears that, already for moderately large m, the effect of the 'normalization' is exhausted producing fewer number of new genes. The three questions raised above can be seen as particular instances of classical species sampling problems: indeed, in the present context each species takes on the meaning of gene and the population is given by the library. Species problems appear in a variety of different applied situations such as astronomy, ecology, linguistics, machine learning, population biology. We now briefly recall well-known estimation methods which have recently been applied to EST data and then outline the key ideas of our Bayesian nonparametric approach. m m m

Estimation methods
The main frequentist tools, that are useful for inference on the cDNA library properties described in the previous section, are based on the theory set forth in Good [3] and Good and Toulmin [4], where nonparametric estimators for the sample coverage and the expected number of new species to be detected in a future sample of size m, given the initial sample, are provided. The estimator of the sample coverage in [3] coincides with the proportion of distinct species represented by at least two units in the sample. Good attributes the original idea to Turing and this explains why it is usually referred to as Turing estimator. The popular Good-Toulmin estimator for the number of new species to be observed in a future sample is derived in [4] and, as a by-product, an evaluation of the discovery probability is achieved. Recently, the interest in species sampling problems has remarkably grown, mainly due to their importance in genomics. Indeed, Mao [5] studies various properties of the Good-Toulmin estimator and shows that it can be also viewed as a nonparametric empirical Bayes estimator. In [6], the authors suggest a parametric variation of the Good-Toulmin estimator. An alternative to it is presented in [7], where the detection of ESTs from each gene in EST sequencing is modeled by means of a Poisson process whose intensity is governed by some unknown distribution. It is to be noted that all frequentist nonparametric approaches lead to reliable estimates for the number of new genes in an additional sample only if its size is not too large. For instance, if the size of the additional survey m is larger than the initial sample n, it is well-known that the Good-Toulmin predictor can become a monotone decreasing function of m: this leads to the paradox of predicting fewer new genes by enlarging the additional sample size m. Even the nonparametric alternative proposed in [7] yields reliable results only when m ≤ 2n. This fact is also outlined in [8]. Hence, if one wishes to predict the number of new genes for large m, one needs to resort to a parametric framework. As we will see, the relative dimension of m with respect to n is not an issue in a Bayesian nonparametric framework, and the expected number of new genes that will be discovered in m further reads is monotone increasing with respect to m.
The application of Bayesian methods in this area of research is, to the authors' knowledge, quite modest even if the Bayesian learning scheme is very well suited for making predictions with EST data. An early contribution, based on a model for sampling from a finite population, is provided by Hill [9] where posterior estimates of the coverage are obtained. However, computational problems do not allow, in this approach, a direct and effective evaluation of the expected number of new species in a future sample. Recently, Lijoi et al. [10] have proposed new Bayesian nonparametric estimators for the problems a)-c) mentioned above. The prior distribution they employ is induced by a family of exchangeable Gibbs random partitions. See Pitman [11] for an interesting review of recent advances and applications of the theory of Gibbs random partitions. Their application to a Bayesian inferential framework is very useful since they provide a general scheme which encompasses some of the most notable nonparametric priors such as the Dirichlet and the two parameter Poisson-Dirichlet process. In this paper, we apply the general formulas derived in [10] with the two parameter Poisson-Dirichlet process as prior distribution. It will be seen that the resulting expressions can be evaluated exactly and do not need for any supplementary simulation scheme. Moreover, such a Bayesian approach does not incur in any problem for large values of m since all possible behaviors of future EST data are incorporated in the probabilistic model.

EST Datasets
The datasets we analyze consist of ESTs samples obtained from cDNA libraries from two different organisms: the amitochondriate protist Mastigamoeba balamuthi (nonnormalized and normalized libraries, where the normalized library was prepared from the non-normalized library) and Naegleria gruberi libraries, prepared from cells grown under different culture conditions, aerobic and anaerobic. These data sets have been previously analyzed in [6], where a full account of their preparation is detailed. It is worth mentioning that our approach assumes fulllength cDNA clones and high quality sequence reads. Therefore, possible errors associated with the clustering procedure are not considered. For the statistical identification and evaluation of types of clustering errors one may incur in EST sequencing, the reader is referred to Wang et al. [12].
Specifically, each EST survey consists of n reads with k unique genes and corresponding frequencies n 1 ,...,n k , i.e.  Table 1 summarizes the four EST samples using the compact notation set in (1). For example, the survey of the naeglaria aerobic library produces n = 959 reads with k = 473 unique genes, which are clustered into 17 levels of expression 1, 2,..., 12,16,17,18,27,55. For the first level we have r 1 = 346, meaning that 346 genes appear just once, that is n 1 = n 2 = … = n 346 = 1. For the second level r 2 = 57 implies that 57 genes appear twice and, hence, n 347 = n 348 = … = n 403 = 2 and so on up to r 55 = 1, which means that 1 gene is represented 55 times yielding n 473 = 55.

Coverage, estimation of the number of new genes and discovery rate
We applied the Bayesian nonparametric method, detailed in the following section, to these datsets and obtained the following results. Denote the unknown proportion of genes (in the whole library) belonging to the i-th class by p i . Then, the coverage of the initial sample of size n is given by which is precisely the proportion of unique genes represented in the initial sample. Our estimates for the coverage are 0.47 and 0.45 for the non-normalized (n = 715) and normalized (n = 363) Mastigamoeba, respectively. This means that, by virtue of the 'normalization', an initial sample of about half the size produces almost the same coverage. Moreover, we get 0.64 and 0.49 for the aerobic (n = 959) and anaerobic (n = 969) Naegleria, respectively: clearly, the tissue cultured aerobically achieves a remarkably higher coverage with an initial sample of the same size, which may lead to conclude that the sequencing of the aerobic tissue is more effective. However, such a finding could also be the consequence of a higher redundancy in the aerobic library. Finally, it is worth noting that our results for the coverage match exactly the ones obtained in [6], where the frequentist estimator described in [3] was exploited.
Turning attention to predicting the outcomes of future sequencing for the libraries at issue, we focus on the expected number of new genes in an additional sample of size m and on the discovery rate. The first index provides an overall measure of redundancy with respect to the additional sample of size m, whereas the discovery rate predicts the trend at which the discovery probability decays as more and more reads come in. If one adopts a Bayesian nonparametric approach, these quantities can be estimated rigorously and exactly since such an approach is naturally designed for prediction. In contrast note that, as already anticipated, the Good-Toulmin estimator becomes highly variable and unstable if the size of the additional sample m is larger than the size of the initial sample n. In particular, the Good-Toulmin estimator often produces negative values as estimates for the number of new genes if m ∈ (n, 2n) and almost always   Source: Susko and Roger [6] behaves badly for m > 2n. Such a phenomenon can be seen in Figure 1 for the two Naegleria libraries. In order to overcome these problems, frequentist methods typically give up the flexibility of the nonparametric approach and resort to parametric models, whose fit can be a delicate issue. For instance, Susko and Roger [6] resort to an approximated version of the Good-Toulmin estimator which assumes a parametric model for the expression levels r l .
In order to give a complete picture, it is important to accompany our point estimates with the 95% highest posterior density intervals, which represent the Bayesian counterpart to frequentist confidence intervals (see [13]).
In Tables 2 and 3 the results arising by the application of the Bayesian nonparametric method are displayed.
As for the Mastigamoeba libraries, an interesting phenomenon takes place: the survey of the normalized library has achieved almost the same coverage (0.45) as the non-normalized library (0.47), but when considering an additional sample it exhibits a significantly faster decay in the discovery rate. Figure 2 compares the discovery rate for the two libraries. It is worth pointing out that our estimates predict that the discovery rates associated to both libraries coincide for m = 125 yielding a discovery probability of 0.508. For larger m the non-normalized exhibits a higher discovery rate. This implies that at some point also the estimates for the expected number of new genes in the additional sample will coincide: indeed, this is estimated to happen for m = 270, for which 137 new genes are predicted to be identified from both libraries. Hence, for m > 270 the expected number of new genes is systematically higher for the non-normalized library. For instance, if m = 1089, just 477 new genes are expected for the normalized library and 510 for the non-normalized. Taking m larger, at some point even the highest posterior density intervals will not overlap anymore. Such a behavior hints toward the fact that, in deciding whether to perform a 'normalization' protocol, the sizes of the samples to be drawn from the libraries is a variable to be taken into account.
As for the Naegleria libraries the behavior is apparent in the sense that the anaerobic library systematically produces more new genes and the discovery probability is sensibly higher at the considered levels of m. Note that the aerobic library presents a slightly slower decay rate but an extremely large m is required for matching the expected number of genes of the anaerobic one. Figure 3 displays the estimated decay rate of the discovery probability for both libraries with the corresponding 95% highest posterior density intervals.

A Bayesian nonparametric methodology
The primary aim of the Bayesian approach to inference is prediction and Bayesian methods are tailored for conveying the available information into prediction. In particular, for EST sequencing, the main problem of frequentist methods is represented by the difficulty of incorporating not yet observed unique genes into the model. This can then produce unpleasant behaviors of estimators such as the one exhibited by the Good-Toulmin estimator discussed before. In contrast, the Bayesian nonparametric approach naturally incorporates the fact that further sequencing will feature new unique genes and leads to consistent predictions.
In our framework we are going to consider a sample of n EST data yielding K n distinct gene species with corresponding frequencies N = (N 1 ,..., ). Clearly formula [15] which can be recovered by letting σ tend to zero and it represents a important formula in modern probability theory. See [11]. Recently, it has found many interesting applications for bacterial taxonomy [16], clustering of microarray gene expression data [17], mixture models [18], linguistics [19], among others.
In a Bayesian nonparametric setting, one alternatively obtains model (3) by selecting the two parameter Poisson-Dirichlet process as a prior for the genes proportions within the library. This clearly makes the sequence of tags exchangeable, thus implying that the order of appearance of the tags does not influence probability Non-normalized and normalized Mastigamoeba libraries: the first column provides the size of the additional sample in % of the size of the initial sample, the second the actual size of the additional survey, the third presents the expected number of new genes and the fourth the discovery probability. The estimates in the third and fourth column are accompanied by the 95% highest posterior density intervals. Naeglaria aerobic and anaerobic libraries: the first column provides the size of the additional sample in % of the size of the initial sample, the second the actual size of the additional survey, the third presents the expected number of new genes and the fourth the discovery probability. The estimates in the third and fourth column are accompanied by the 95% highest posterior density intervals.
assessments. Such an assumption, which constitutes the Bayesian analog of the frequentist assumption of independent and identically distributed data, is clearly reasonable in the context of EST sequences. Note that we implicitly assume that the sequence of tags can be extended to infinity. However, the size of the library represents an upper bound for the number of unique genes that will be observed and it is always finite: hence, all the estimates we are going to obtain will be finite.
As mentioned before, the Bayesian nonpara metric approach has the advantage of yielding, in a straight-forward way, predictive distributions for future observations given the data. Considering Pitman's sampling formula, the probability of detecting a new gene from a future observation, given a sample of n tags containing k distinct genes, is whereas the probability of re-observing the j-th unique gene coincides with (n j -σ)/(θ + n) j = 1,...,k.
As already pointed out, in the analysis of ESTs one is also interested in evaluating: (i) the expected number of new genes that will be recorded in a further sample of size m and (ii) the discovery probability, which is the probability of observing a new gene in the (n + m + 1)-th draw, given the initial sample of size n. The basis for deriving estimators for these quantities is represented by the distribution of the number of new genes to be observed in an additional sample given the initial sample. Such a posterior probability, which can be seen as the predictive distribution for the outcome of additional m reads, is given by See [10] for details on its derivation. From (7) Bayes estimators (under quadratic loss function) for both the expected number of new genes and the discovery probability have been obtained, within general Gibbs random partition models, in [10]. The expected number of new genes observed in a future sample of size m coincides with and the discovery probability turns out to be equal to Moreover, the highest posterior density intervals can be derived in quite a straightforward way from (7). The only Size of the additional sample Expected number of new genes Aerobic Anaerobic point left to discuss concerns the specification of the parameters (σ, θ). In order to avoid subjective inputs in the model, (σ, θ) is fixed according to an empirical Bayes rule which consists in choosing σ and θ that maximize (3) corresponding to the observed sample (k, n 1 ,...,n k ), i.e. Figure 4 provides the contour plots corresponding to the two Naegleria gruberi datasets: the parameters maximizing Contour plots of Pitman's sampling formula. Contour plots of Pitman's sampling formula, as a function of (σ, θ), corresponding to the two Naegleria gruberi datasets: aerobic (right) and anaerobic (left).
increases one expects that a larger number of new genes is going to be observed in a further m-sample. This is also confirmed by the behavior of as σ varies. Figure 3 suggests an almost linear increase of , as a function of m, and accordance with linearity is higher the closer σ is to 1. In contrast, when σ is low and close to 0 the function is concave and increases at a lower rate as σ increases.
In order to further evaluate the goodness of the estimation method, we implemented a cross-validation procedure. Given a dataset of size n, this consisted in randomly drawing without replacement a sub-sample of size n/2. On the basis of this sub-sample, we have fixed, according to (10), the values of (σ, θ).

Conclusion
In this paper we have presented a Bayesian nonparametric approach, which relies on Pitman's sampling formula, for prediction problems arising in sequencing of EST librar-ies. This provides a fully probabilistic model which conveys, in a statistically rigorous way, the available information into prediction. No parametric assumption is made and the prior is fixed using an empirical Bayes approach. The resulting estimators are applied to four EST libraries and lead to interesting and coherent predictions of the outcome of additional sequencing. The arising information is of great value for researchers providing guidelines in: establishing the quality of a certain library; deciding whether to perform a normalization protocol; choosing whether to proceed with sequencing from a certain library; determining the size of an additional EST survey etc.
It is important to remark that our Bayesian nonparametric approach does not feature problems usually exhibited by frequentist methods. In particular, no ad-hoc adjustments or introduction of parametric components is necessary for predicting future reads if their number is larger than the initial survey. Finally, the estimators presented here can be easily adapted to take into account joint data from multiple libraries leading to Bayesian analogs of the estimators set forth in [6].

Methods
Here we briefly describe how the estimators in (8) and (9) are derived by simplifying the expressions provided in [10]. In particular, one finds out that , where the 's are displayed in (7) and can be deduced from (8) in [10]. The further simplification yielding the expression of in (8) is obtained by observing that (θ + 1) n-1 /(θ + 1) n + m-1 = 1/(θ + n) m and As far as the determination of (9), note that where (1) is the probability of observing a new gene at the (n + m + 1)-th draw given the in the previous sample, of size n + m, there have been detected k + j distinct genes. Hence, by virtue of the prediction structure associated with the two parameter Poisson-Dirichlet