Discarding duplicate ditags in LongSAGE analysis may introduce significant error
© Emmersen et al. 2007
Received: 25 September 2006
Accepted: 14 March 2007
Published: 14 March 2007
Skip to main content
© Emmersen et al. 2007
Received: 25 September 2006
Accepted: 14 March 2007
Published: 14 March 2007
During gene expression analysis by Serial Analysis of Gene Expression (SAGE), duplicate ditags are routinely removed from the data analysis, because they are suspected to stem from artifacts during SAGE library construction. As a consequence, naturally occurring duplicate ditags are also removed from the analysis leading to an error of measurement.
An algorithm was developed to analyze the differential occurrence of SAGE tags in different ditag combinations. Analysis of a pancreatic acinar cell LongSAGE library showed no sign of a general amplification bias that justified the removal of all duplicate ditags. Extending the analysis to 10 additional LongSAGE libraries showed no justification for removal of all duplicate ditags either. On the contrary, while the error introduced in original SAGE by removal of naturally occurring duplicate ditags is insignificant, it leads to an error of up to 3 fold in LongSAGE. However, the algorithm developed for the analysis of duplicate ditags was able to identify individual artifact ditags that originated from rare nucleotide variations of tags and vector contamination.
The removal of all duplicate ditags was unfounded for the datasets analyzed and led to large errors. This may also be the case for other LongSAGE datasets already present in databases. Analysis of the ditag population, however, can identify artifact tags that should be removed from analysis or have their tag count adjusted.
Serial Analysis of Gene expression (SAGE) is a global and digital gene expression profiling method [1, 2]. It relies on three fundamental principles: (i) a short nucleotide tag cut from a cDNA copy of an mRNA is sufficient to uniquely identify the transcript, (ii) two tags can be ligated together to form ditags and unambiguously amplified by PCR, and (iii) multiple tags can be concatenated for efficient detection by DNA sequencing. The overall reliability of SAGE has been compared to other gene expression profiling methods such as Northern Blots , real-time or kinetic PCR[4, 5], and cDNA and oligo nucleotide micro array hybridizations [6–8]. It was generally found that the reliability and reproducibility of SAGE is high. Typically 70–85% of gene expression changes observed in SAGE can be confirmed by a different method [4, 7]. However, a potential bias introduced by amplification of ditags was discussed already in the original SAGE publication . It was suspected that duplicate ditags, i.e. identical copies of a ditag (AB), would occur only as an artifact of PCR amplification. Therefore, duplicate ditags have been removed prior to tag counting in most SAGE studies so far, partly because of requests from reviewers before publication.
However, duplicate ditags will be encountered naturally with a certain frequency, depending on abundance of the two transcripts from which the ditag is derived [8, 9]. For example, in the original SAGE protocol two blunt ended 14 nucleotide tags were ligated to form ditags. Two tags A and B, each occurring at a frequency of 0.02 have a 0.0004 probability of being joined. Present SAGE studies typically include 50,000 tags (25,000 ditags) leading to 10 AB+BA ditags. However, the total count of a tag of 0.02 frequency in 50,000 is 1000, and the error of 10 introduced by removing the naturally occurring ditags is insignificant. Furthermore, an algorithm to minimize this problem (SAGEparser) was developed by Snyder and coworkers.
However, recent developments in SAGE technology have accentuated the problem of discarding duplicate ditags. First, there has been a drive towards using smaller samples for construction of SAGE libraries, facilitating the analysis of cells with specialized functions such as pancreatic cells obtained from biopsies . Such samples may have extreme gene expression profiles with single transcripts accounting for 5 % of the total population of transcripts. Second, the widespread use of the LongSAGE protocol in which a two base pair overhang is used in the ligation of ditags, instead of blunt ends . Consequently, any LongSAGE tag can only form ditags with tags with a compatible overhang, in principle reducing the number of potential partner tags 16 fold on the average. In this paper we analyze the error introduced by discarding naturally occurring duplicate ditags in LongSAGE and describe a probabilistic algorithm that can distinguish naturally occurring ditags from artifacts.
As can be seen in figure 1, discarding duplicate ditags introduces a serious error for abundant tags in LongSAGE data analysis. For example, two abundant tags, each present 1500 times in a typical 50,000 tag study, would give rise to only 45 duplicate ditags in SAGE, but 720 duplicate ditags in LongSAGE. Counting duplicates only once, as is presently done, would result in an error of 3% for SAGE and 48% for LongSAGE. In this example we have removed the duplicates for AB only, but in reality a particular tag A, will create duplicates with any compatible tag B, C, D etc. at the frequency stipulated in equations 1 or 2. If each of these duplicates is counted only once, a further reduction of the tag count is introduced and, therefore, the error will increase. Indeed, this simple estimate demonstrates that the problem is expected to be much greater for LongSAGE, than for conventional SAGE.
However, the assumption of equal proportions of compatible overhangs in LongSAGE is unrealistic. The genome sequence is not a random distribution of nucleotides  and furthermore, this would limit the maximum tag count of any tag to 1/16 of the total tag count of the library (e.g. 3125 in 50,000), and individual tag counts larger than 3125 have been encountered.
Abundant LongSAGE tags observed in pancreatic acinar cells.
The relationship of tag abundance and degree of change introduced by removal of duplicate ditags.
# of unique tags
Observed change upon removal of duplicate ditags
The corrective measures suggested by Welle and Snyder  are both iterative approaches developed for SAGE. Welle suggests splitting up the dataset in a number of sub-datasets, removing duplicate ditags, and then adding these datasets together allowing duplicates. The number of subdatasets to be created, determining the maximal number of duplicate ditags is a simple guess. In fact, this method is equivalent to setting a maximal allowed ditag count instead of excluding all. In LongSAGE, duplicate ditag counts of several hundreds are frequently observed. Therefore, determining a low meaningful fixed number of duplicate ditags is not feasible. Snyder's algoritm (SAGEparser) includes a proportion of the observed duplicate ditags based on the abundance of the two monotags comprising the ditag. This new tag count can then be used to calculate a new proportion of the observed duplicate ditags to be added. Many iterations of this algorithm would approach the inclusion of all duplicate ditags. While this algorithm includes some of the naturally occurring ditags, it only works for SAGE where all tags can form ditags with each other and does not address whether an entire library is biased or not.
Summary of ditag statistics.
Pancreatic acinar cells
AA or TT
AC or GT
AG or CT
CA or TG
CC or GG
GA or TC
The number of ditags in the 10 possible overlapping classes is tabulated in table 3. The length distribution shows a general overrepresentation of the long form. Apparently, the tag generating restriction enzyme, MmeI, cleaves the DNA strand 21/19 nucleotides downstream of its recognition site twice as often as 20/18 nucleotides. Furthermore, the true distribution among compatible overlapping classes is far from uniform. The overlap CG was only present in 17 ditags, whereas the AA (or TT) was present in 3173 ditags. This is in line with the relative dinucleotide abundance in humans, which also shows a severe under representation of CG, whereas AA (or TT) is more commonly observed . This observation is corroborated with the finding that CpG islands are predominantly found in the first exon of genes and therefore rarely in the 3' end of transcripts mostly represented in SAGE . Analyzing the distribution of overlaps generated in-silico from the human RefSeq v. 16 also shows a similar tendency, albeit to a lesser extent (table 3).
The uneven distribution of compatible overhangs actually raises the question whether a particular overhang can be present in such a low abundance that it suppresses the measurement of other tags due to an insufficient number of compatible monotags. One solution to this would be to conduct the LongSAGE experiments by blunt-ending by T4-DNA polymerase prior to ligation of ditags, thus shortening the LongSAGE tags by two nucleotides . An advantage of this approach is that the error introduced by removal of duplicate ditags is small and similar to those indicated for SAGE (figure 1). However, in the case presented here, saturation of compatible overlaps is not the reason for the rare occurrence of the CG overlap, as CG is a palindrome and thus may form ditags with itself.
Detection of outliers is performed by calculating standardized residuals according to equation (4) for each ditag. Assuming normal distribution of the standardized residuals, the standard deviation is calculated for all ditags observed more than once. An observation is classified as an outlier if the standardized residual is larger than three standard deviations (99% confidence). The standardized residuals are plotted in figure 3b. The low number of standardized residuals that fall outside the confidence interval (outliers) indicates that most duplicate ditags seem to represent true tags.
Comparison of outlier ditags with the matched database sequences.
To investigate whether this is special to this particular dataset, we have performed the analysis on additional datasets, five derived from potato tuber (Høgh, Emmersen and Nielsen, unpublished) and five other libraries derived from pancreatic tissue . The analysis can be carried out on any LongSAGE library, but becomes more precise with more ditags included in the analysis. In our experience, depending on the proportion of duplicate ditags present, a minimum library size of 35,000 tags seems to be the lower limit for a reliable analysis. In the future, exploiting new DNA sequencing technologies, libraries larger than 150,000 tags will probably be common. Varying numbers of duplicate ditags were present in these libraries, and suspicious outlier ditags were identified in all libraries. However, in none of the libraries the duplicate ditags were biased to an extent that justified the bulk removal of all duplicates. Tag extractions including or excluding duplicate ditags of the most abundant transcripts of all additional 10 libraries analyzed is shown in [additional file 3]. The libraries are affected to a different extent depending on the transcription profile, but all libraries show a change of up to 1.5–2 folds among the 20 most abundant transcripts, when including or excluding duplicate ditags [see additional file 3]. However, similar to the pancreatic acinar library, the effect was not restricted to high abundance tags, but was observed in medium abundance tags as well (data not shown). The NCBI database, SAGEdB, currently contains 625 SAGE libraries of which 155 are LongSAGE. Unfortunately, ditag sequences are not reported so re-analysis of existing data will have to be carried out by the submitters which hold the original sequence files.
Additional transcript changes detected between pancreatic acinar and ductal cells by including duplicate ditags.
full-length cDNA clone CS0DC017YH08
Human mitochondrial genes
Integrin beta-6 precursor
Neutrophil gelatinase-associated lipocalin
gastrointestinal glutathione peroxidase 2
Neutral and basic amino acid transport protein
Heparan sulfate 3-O-sulfotransferase-1
Insulin-like growth factor binding protein 7
Krueppel-like factor 5
Connective tissue growth factor
full-length cDNA clone CS0DF028YA19
mitochondrial ATP synthase
It has been found, that the cross-platform agreement of transcriptome measurements by SAGE, LongSAGE and DNA oligonucleotide microarrays was only modest, in contrast to a good intra-platform reproducibility . A majority, but not all of the differentially expressed genes identified by either method can be verified by RT-PCR (e.g.  and ). Therefore, to maximize the accuracy of LongSAGE and hopefully improve the cross-platform consistency, it is important analyze and adjust the tag sequences (including all duplicate ditags) for a number of potential biases prior to mapping to database sequences. First, linker sequences should be removed. Second, the dataset should be tested with the present algorithm. Duplicate ditag counts that falls within the confidence intervals should not be adjusted. But in the case of outliers, these can be manually removed after careful analysis (as in this study), or automatically adjusted to the predicted count by the algorithm to minimize the impact of artifacts. Third, to minimize the number of false positives identified as differentially expressed genes, artifact tags originating from sequence errors should be resolved using SAGEScreen .
The analyses presented here clearly demonstrate that the present procedure of discarding all duplicate ditags can lead to large errors in LongSAGE studies. Instead, the algorithm described here should be used to test LongSAGE datasets and identify potentially biased ditags that should be adjusted or removed. Based on the results obtained it is likely, that most of the transcriptome profiles present in the databases have been artificially biased by the removal of duplicate ditags.
Assuming that the observed tag counts (after amplification and sequence extraction) are representative of the actual distribution of tag molecules, the expected occurrence of a duplicate ditag AB in SAGE can be approximated by
where D is the number of ditags, P the probability, and T the number of monotags observed.
The expected occurrence of a duplicate ditag AB in LongSAGE, assuming even distribution of compatible overlapping classes is then (including duplicate ditags).
The expected occurrence of a duplicate ditag AB in LongSAGE, using dataset specific distributions of compatible overlapping classes can be approximated by
where TPPT is the sum of all possible partner tags.
Standardized residuals was calculated as follows 
A SAGE experiment is performed by digesting cDNA with the frequent cutting restriction enzyme NlaIII, isolating the most 3' fragment and ligating a linker containing the sequence TCCGAC, which is recognized the restriction enzyme MmeI. Tags are generated by MmeI which cleaves the DNA strand 20/18 nt or 21/19 nt downstream of this sequence. Ligated ditags have the general structure CATGXXXXXXXXXXXXXXXX(X)(Y)YYYYYYYYYYYYYYYYCATG, where X denotes tag A and Y denote the reverse complement of tag B. The parentheses indicate that most tags exist in both a short and a long form. Hence, the ditag AB can have the length 40, 41 or 42 nucleotides. Two central base pairs are common to both tag A and tag B and originate from the overlap used during ligation. The Perl script, LongSAGEbias.pl [additional file 4] was developed and used for the data analysis (freely available at ). A schematic representation of the script is shown in figure 2. First, the script extracts both ditags and monotags (21 nt), including possible linker derived tags, from DNA sequence files and the corresponding quality values (*.phd) generated by the Phred base caller . Second, the script then calculates the length of each ditag and uses the 40 nt and 42 nt ditags for the calculation of the distribution of compatible overlapping tags and the propensities that each tag is 21 (short form) or 22 (long form) nucleotides. This information is then used to predict the occurrence of any ditag composed of tags observed in a duplicate ditag by equation 3 according to the tag length consistent with the ditag in question. For 40 or 42 nucleotide ditags, a prediction is made for each of the two monotags constituting the ditag.
In the case of 41 nucleotide ditags, the ditag AB is first analyzed. Since A can exist in a 41 nt ditag both in the long and a short form, two predictions are made and the one closest to the observed is chosen. Then, the ditag BA is considered in an identical manner. The standardized residuals are calculated and the results are written to tabulator separated files easily imported into any spreadsheet for further analysis. Assuming the ditag counts are Poisson distributed, the mean can be estimated as the observed count and the standard deviation as the square root of the observed. The confidence interval of ditag counts can thus be estimated as mean ± 2*standard deviation. For small ditag counts this confidence interval extends below zero. Consequently, the standard deviation of the standardized residuals is calculated from ditags observed four or more times only (4-2*√4 = 0).
The algorithm can be set to include all duplicate ditags, remove all duplicate ditags and adjust the observed ditag counts that fall outside the confidence interval to the prediction value.
In sum, libraries derived from pancreatic acinar cells, ductal cells, and four libraries from different grades of pancreatic intraepithelial neoplasia were analyzed from pancreas. In addition, 5 potato tuber libraries derived from 6 week old minitubers, at harvest, two libraries from 60 days post harvest dormant tubers, and from tuber tissue excised from under an emerging sprout.
LongSAGE tags of 17 nt + CATG were extracted from the human RefSeq v. 16 fasta file and the dinucleotide overlap distributions determined using the PERL script dinuccount.pl .
LongSAGE tags from libraries generated from all potato and pancreatic tissue were extracted using the Perl script sage-phred.pl. For pancreatic acinar and ductal cells the tags were mapped to the Human Gene Index  using sagemap.pl. The two libraries were compared using the Perl script acprob.pl and statistically significant changes using strict Bonferroni correction was recorded including or excluding duplicate ditags. All scripts are available at .
The authors would like to thank Poul Svante Eriksen, Department of Mathematical Sciences, Aalborg University for valuable help with the statistics and Sabine Burkert, Birgit Strzeletzki and Susanne Braun for excellent technical assistance preparing the pancreatic acinar library. This work was supported by the Danish Veterinarian and Agricultural Research Council (23-02-0034) and the Danish Technical Research Council (26-00-0141). A.M.H. and S.A.H. were supported by grants from the Deutsche Krebshilfe (70-2988-Schm3), the Bundesministerium für Bildung und Forschung (0311878) and by EU-Grant BMH4-QLG1-CT-2002-01196.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.