Discarding duplicate ditags in LongSAGE analysis may introduce significant error

Background During gene expression analysis by Serial Analysis of Gene Expression (SAGE), duplicate ditags are routinely removed from the data analysis, because they are suspected to stem from artifacts during SAGE library construction. As a consequence, naturally occurring duplicate ditags are also removed from the analysis leading to an error of measurement. Results An algorithm was developed to analyze the differential occurrence of SAGE tags in different ditag combinations. Analysis of a pancreatic acinar cell LongSAGE library showed no sign of a general amplification bias that justified the removal of all duplicate ditags. Extending the analysis to 10 additional LongSAGE libraries showed no justification for removal of all duplicate ditags either. On the contrary, while the error introduced in original SAGE by removal of naturally occurring duplicate ditags is insignificant, it leads to an error of up to 3 fold in LongSAGE. However, the algorithm developed for the analysis of duplicate ditags was able to identify individual artifact ditags that originated from rare nucleotide variations of tags and vector contamination. Conclusion The removal of all duplicate ditags was unfounded for the datasets analyzed and led to large errors. This may also be the case for other LongSAGE datasets already present in databases. Analysis of the ditag population, however, can identify artifact tags that should be removed from analysis or have their tag count adjusted.


Background
Serial Analysis of Gene expression (SAGE) is a global and digital gene expression profiling method [1,2]. It relies on three fundamental principles: (i) a short nucleotide tag cut from a cDNA copy of an mRNA is sufficient to uniquely identify the transcript, (ii) two tags can be ligated together to form ditags and unambiguously amplified by PCR, and (iii) multiple tags can be concatenated for efficient detection by DNA sequencing. The overall reliability of SAGE has been compared to other gene expression profiling methods such as Northern Blots [3], real-time or kinetic PCR [4,5], and cDNA and oligo nucleotide micro array hybridizations [6][7][8]. It was generally found that the reliability and reproducibility of SAGE is high. Typically 70-85% of gene expression changes observed in SAGE can be confirmed by a different method [4,7]. However, a potential bias introduced by amplification of ditags was discussed already in the original SAGE publication [1]. It was suspected that duplicate ditags, i.e. identical copies of a ditag (AB), would occur only as an artifact of PCR amplification. Therefore, duplicate ditags have been removed prior to tag counting in most SAGE studies so far, partly because of requests from reviewers before publication.
However, duplicate ditags will be encountered naturally with a certain frequency, depending on abundance of the two transcripts from which the ditag is derived [8,9]. For example, in the original SAGE protocol two blunt ended 14 nucleotide tags were ligated to form ditags. Two tags A and B, each occurring at a frequency of 0.02 have a 0.0004 probability of being joined. Present SAGE studies typically include 50,000 tags (25,000 ditags) leading to 10 AB+BA ditags. However, the total count of a tag of 0.02 frequency in 50,000 is 1000, and the error of 10 introduced by removing the naturally occurring ditags is insignificant. Furthermore, an algorithm to minimize this problem (SAGEparser) was developed by Snyder and coworkers [10].
However, recent developments in SAGE technology have accentuated the problem of discarding duplicate ditags. First, there has been a drive towards using smaller samples for construction of SAGE libraries, facilitating the analysis of cells with specialized functions such as pancreatic cells obtained from biopsies [4]. Such samples may have extreme gene expression profiles with single transcripts accounting for 5 % of the total population of transcripts. Second, the widespread use of the LongSAGE protocol in which a two base pair overhang is used in the ligation of ditags, instead of blunt ends [2]. Consequently, any Long-SAGE tag can only form ditags with tags with a compatible overhang, in principle reducing the number of potential partner tags 16 fold on the average. In this paper we analyze the error introduced by discarding naturally occurring duplicate ditags in LongSAGE and describe a probabilistic algorithm that can distinguish naturally occurring ditags from artifacts.

Results and discussion
A prediction of the number of duplicate ditags as a function of the abundance of the two monotags in SAGE and LongSAGE is shown in figure 1. It illustrates the error introduced by deleting duplicate ditags as commonly practiced in these analyses. In original SAGE, all bluntended tags can combine with all other tags and, therefore, the number of a particular duplicate ditag AB can be predicted from equation 1. In LongSAGE, tags have a 2 nt overhang and their ditag partner is constrained accordingly. Therefore, the number of a particular LongSAGE ditag can be estimated from equation 2, assuming an even distribution of possible overhangs.
As can be seen in figure 1, discarding duplicate ditags introduces a serious error for abundant tags in LongSAGE Estimating occurrence of duplicate ditags in SAGE based on an even distribution of compatible overlapping tags data analysis. For example, two abundant tags, each present 1500 times in a typical 50,000 tag study, would give rise to only 45 duplicate ditags in SAGE, but 720 duplicate ditags in LongSAGE. Counting duplicates only once, as is presently done, would result in an error of 3% for SAGE and 48% for LongSAGE. In this example we have removed the duplicates for AB only, but in reality a particular tag A, will create duplicates with any compatible tag B, C, D etc. at the frequency stipulated in equations 1 or 2. If each of these duplicates is counted only once, a further reduction of the tag count is introduced and, therefore, the error will increase. Indeed, this simple estimate demonstrates that the problem is expected to be much greater for LongSAGE, than for conventional SAGE.
However, the assumption of equal proportions of compatible overhangs in LongSAGE is unrealistic. The genome sequence is not a random distribution of nucleotides [11] and furthermore, this would limit the maximum tag count of any tag to 1/16 of the total tag count of the library (e.g. 3125 in 50,000), and individual tag counts larger than 3125 have been encountered.
The experimental dataset derived from RNA isolated from pancreatic acinar cells by the aRNA-LongSAGE procedure was therefore analyzed in greater detail [4]. It contains 44,276 tags before removal of duplicate ditags and 31,868 after. The unusual high numbers of duplicate ditags reflects an extraordinary abundance of transcripts encoding the enzymes of the digestive juice (table 1). Table 1 shows the 12 most abundant tags, the enzyme encoded and the tag counts before and after removal of duplicate ditags. Removing duplicate ditags from the dataset reduces the tag count with up to 67% for the most abun-dant tag. The most abundant tags are the ones most often affected, although medium abundance tags are also significantly affected, if they are compatible with predominant tags (see table 2, and additional file 1). In fact, only for total tag counts (without removal of duplicate ditags) less than 10 is the majority of tags unchanged by removal of duplicate ditags. And 19% of medium abundance tags ) are changed at least 1.5 fold. Bear in mind, that the error is not affecting all tags to a similar degree, while some tags are unchanged, others may be greatly affected. For an example, the tag count for CATGGGCGACTCT-GGCGGCCC is 40 tags before removal and only 14 after, a change of 3 fold [additional file 1]. Anisimov et al. have argued that up to 5% false ditags, so-called quasi-ditags should be removed from SAGE analyses [12]. In our case, removing quasi-ditags has only marginal effect of the analysis (data not shown), presumably because the error corrected by Anisimov et al. is exclusively affecting rare tags, whereas we are concerning with an error increasingly affecting tags the more abundant they get.
The corrective measures suggested by Welle [9]and Snyder [10] are both iterative approaches developed for SAGE. Welle suggests splitting up the dataset in a number of subdatasets, removing duplicate ditags, and then adding these datasets together allowing duplicates. The number of subdatasets to be created, determining the maximal number of duplicate ditags is a simple guess. In fact, this method is equivalent to setting a maximal allowed ditag count instead of excluding all. In LongSAGE, duplicate ditag counts of several hundreds are frequently observed. Therefore, determining a low meaningful fixed number of duplicate ditags is not feasible. Snyder's algoritm (SAGEparser) includes a proportion of the observed In this study, an algorithm implemented in Perl was developed (LongSAGE_bias.pl, see methods for details) which extracts both monotags and ditags from phred or fasta formatted sequence files, defines the two nt overhang of tag pairs in the ditags, and counts and sorts these ditags into compatible overlapping classes. Of the 44,276 tags in total, 34,464 were seen twice or more. Considering these tags only (thus excluding most tags originating from sequencing error) 12,408 (36%) were present in duplicate ditags. A major complication of the analysis is the presence of most abundant tags in several forms differing in length by one or rarely by two nucleotides. Thus a single tag may be split into two or more compatible overlapping classes. For this analysis, only tags between 40 and 42 nt were considered (including the NlaIII recognition site, CATG of both tags). These accounted for 98.6% of all ditags (table 3). A 40 nucleotide ditag contains two tags in the short form (21 nt each, as 2 nt are shared), a 41 nucleotide ditag one short form and one long form, and a 42 nucleotide ditag contains two tags in the long form (22 nt each, as 2 nt are shared). The relative propensity for a tag to appear in long or short form was calculated for each tag without considering the 41 nucleotide ditag, as for these ditags, we cannot know which of the two tags is present in the long form. Likewise, the compatible overlapping class was determined from the 40 and 42 nucleotide ditags only, because the overlap in these tags can be unambiguously identified as the two central nucleotides of the ditag.
The number of ditags in the 10 possible overlapping classes is tabulated in table 3. The length distribution shows a general overrepresentation of the long form. Apparently, the tag generating restriction enzyme, MmeI, cleaves the DNA strand 21/19 nucleotides downstream of its recognition site twice as often as 20/18 nucleotides. Furthermore, the true distribution among compatible overlapping classes is far from uniform. The overlap CG was only present in 17 ditags, whereas the AA (or TT) was present in 3173 ditags. This is in line with the relative dinucleotide abundance in humans, which also shows a severe under representation of CG, whereas AA (or TT) is more commonly observed [11]. This observation is corroborated with the finding that CpG islands are predominantly found in the first exon of genes and therefore rarely in the 3' end of transcripts mostly represented in SAGE [13]. Analyzing the distribution of overlaps generated insilico from the human RefSeq v. 16 also shows a similar tendency, albeit to a lesser extent (table 3).
The uneven distribution of compatible overhangs actually raises the question whether a particular overhang can be present in such a low abundance that it suppresses the measurement of other tags due to an insufficient number of compatible monotags. One solution to this would be to conduct the LongSAGE experiments by blunt-ending by T4-DNA polymerase prior to ligation of ditags, thus shortening the LongSAGE tags by two nucleotides [14]. An advantage of this approach is that the error introduced by removal of duplicate ditags is small and similar to those indicated for SAGE (figure 1). However, in the case presented here, saturation of compatible overlaps is not the reason for the rare occurrence of the CG overlap, as CG is a palindrome and thus may form ditags with itself.
Two predictions are calculated for the occurrence of each ditag (including 41 nucleotide ditags) using equation 3 (see methods and figure 2 for details). The two predictions are often the same, but may differ for ditags containing less abundant tags and will approach uniformity for higher duplicate ditag counts. A plot of the predicted numbers of ditags against the observed number of ditags is shown in figure 3a [see additional file 2 for details]. The slope of a linear regression analysis of the data is 1.3, in reasonable agreement with the expected 1. However, the Pearson product moment correlation coefficient between the observed and the predicted is a modest 0.61.
Detection of outliers is performed by calculating standardized residuals according to equation (4) for each ditag. Assuming normal distribution of the standardized residuals, the standard deviation is calculated for all ditags observed more than once. An observation is classified as an outlier if the standardized residual is larger than three standard deviations (99% confidence). The standardized residuals are plotted in figure 3b. The low number of standardized residuals that fall outside the confidence interval (outliers) indicates that most duplicate ditags seem to represent true tags.
Inspection of the outliers reveals that most of these ditags contain at least one tag ending in a nucleotide that is inconsistent with the nucleotide sequences of the corresponding Unigene found in the TIGR Human Gene Index (see table 4), and most likely represents nucleotide variations. The overlapping class of this variant tag is inconsistent with the overlapping class of the non-variant tag used for the prediction. Obviously, the algorithm would provide erroneous predictions in these cases, since it is based on the frequency of the non-variant monotag, which is much higher than the frequency of the variant monotag.
A different, rather abundant ditag was observed (86 times) much more often than predicted (8 times). BLAST analysis of this ditag reveals that it consists of two tags derived from the E. coli β-lactamase gene and thus is most likely the result of vector contamination. Removing these data points from the regression increases the Pearson correlation coefficient to 0.95. Therefore, an excellent correlation between the number of duplicate ditags observed and predicted is obtained not discarding duplicate ditags.
To investigate whether this is special to this particular dataset, we have performed the analysis on additional datasets, five derived from potato tuber (Høgh, Emmersen and Nielsen, unpublished) and five other libraries derived from pancreatic tissue [4]. The analysis can be carried out on any LongSAGE library, but becomes more precise with more ditags included in the analysis. In our experience, depending on the proportion of duplicate ditags present, a minimum library size of 35,000 tags seems to be the lower limit for a reliable analysis. In the future, exploiting new DNA sequencing technologies, libraries larger than 150,000 tags will probably be common [15]. Varying numbers of duplicate ditags were present in these libraries, and suspicious outlier ditags were identified in all libraries. However, in none of the libraries the duplicate ditags were biased to an extent that justified the bulk removal of all duplicates. Tag extractions including or excluding duplicate ditags of the most abundant tran- a Fold change is calculated by dividing the total tag count with the tag count obtained after removal of duplicate ditags. Percentage of total number of different tags in the indicated intervals is given in parentheses. Overview of the LongSAGE_bias.pl PERL script used for the data analyses   10 100 1000 10000 Standardized residuals D It has been found, that the cross-platform agreement of transcriptome measurements by SAGE, LongSAGE and DNA oligonucleotide microarrays was only modest, in contrast to a good intra-platform reproducibility [7]. A majority, but not all of the differentially expressed genes identified by either method can be verified by RT-PCR (e.g. [4] and [16]). Therefore, to maximize the accuracy of LongSAGE and hopefully improve the cross-platform consistency, it is important analyze and adjust the tag sequences (including all duplicate ditags) for a number of potential biases prior to mapping to database sequences. First, linker sequences should be removed. Second, the dataset should be tested with the present algorithm. Duplicate ditag counts that falls within the confidence intervals should not be adjusted. But in the case of outliers, these can be manually removed after careful analysis (as in this study), or automatically adjusted to the predicted count by the algorithm to minimize the impact of artifacts. Third, to minimize the number of false positives identified as differentially expressed genes, artifact tags originating from sequence errors should be resolved using SAGEScreen [17].

Conclusion
The analyses presented here clearly demonstrate that the present procedure of discarding all duplicate ditags can lead to large errors in LongSAGE studies. Instead, the algorithm described here should be used to test LongSAGE datasets and identify potentially biased ditags that should be adjusted or removed. Based on the results obtained it is likely, that most of the transcriptome profiles present in the databases have been artificially biased by the removal of duplicate ditags.

Equations
Assuming that the observed tag counts (after amplification and sequence extraction) are representative of the actual distribution of tag molecules, the expected occurrence of a duplicate ditag AB in SAGE can be approximated by where D is the number of ditags, P the probability, and T the number of monotags observed.    [19]). A schematic representation of the script is shown in figure 2. First, the script extracts both ditags and monotags (21 nt), including possible linker derived tags, from DNA sequence files and the corresponding quality values (*.phd) generated by the Phred base caller [20]. Second, the script then calculates the length of each ditag and uses the 40 nt and 42 nt ditags for the calculation of the distribution of compatible overlapping tags and the propensi-ties that each tag is 21 (short form) or 22 (long form) nucleotides. This information is then used to predict the occurrence of any ditag composed of tags observed in a duplicate ditag by equation 3 according to the tag length consistent with the ditag in question. For 40 or 42 nucleotide ditags, a prediction is made for each of the two monotags constituting the ditag.
In the case of 41 nucleotide ditags, the ditag AB is first analyzed. Since A can exist in a 41 nt ditag both in the long and a short form, two predictions are made and the one closest to the observed is chosen. Then, the ditag BA is considered in an identical manner. The standardized residuals are calculated and the results are written to tabulator separated files easily imported into any spreadsheet for further analysis. Assuming the ditag counts are Poisson distributed, the mean can be estimated as the observed count and the standard deviation as the square root of the observed. The confidence interval of ditag counts can thus be estimated as mean ± 2*standard deviation. For small ditag counts this confidence interval extends below zero. Consequently, the standard deviation of the standardized residuals is calculated from ditags observed four or more times only (4-2*√4 = 0).
The algorithm can be set to include all duplicate ditags, remove all duplicate ditags and adjust the observed ditag counts that fall outside the confidence interval to the prediction value.
In sum, libraries derived from pancreatic acinar cells, ductal cells, and four libraries from different grades of pancreatic intraepithelial neoplasia were analyzed from pancreas. In addition, 5 potato tuber libraries derived from 6 week old minitubers, at harvest, two libraries from 60 days post harvest dormant tubers, and from tuber tissue excised from under an emerging sprout.

Analyzing dinucleotide overlap distribution of tags generated in-silico
LongSAGE tags of 17 nt + CATG were extracted from the human RefSeq v. 16 fasta file and the dinucleotide overlap distributions determined using the PERL script dinuccount.pl [19].

Comparison of LongSAGE libraries with and without inclusion of duplicate ditags
LongSAGE tags from libraries generated from all potato and pancreatic tissue were extracted using the Perl script sage-phred.pl. For pancreatic acinar and ductal cells the tags were mapped to the Human Gene Index [21] using sagemap.pl. The two libraries were compared using the Perl script acprob.pl and statistically significant changes using strict Bonferroni correction was recorded including