A comparative analysis of the information content in long and short SAGE libraries
BMC Bioinformatics volume 7, Article number: 504 (2006)
Serial Analysis of Gene Expression (SAGE) is a powerful tool to determine gene expression profiles. Two types of SAGE libraries, ShortSAGE and LongSAGE, are classified based on the length of the SAGE tag (10 vs. 17 basepairs). LongSAGE libraries are thought to be more useful than ShortSAGE libraries, but their information content has not been widely compared. To dissect the differences between these two types of libraries, we utilized four libraries (two LongSAGE and two ShortSAGE libraries) generated from the hippocampus of Alzheimer and control samples. In addition, we generated two additional short SAGE libraries, the truncated long SAGE libraries (tSAGE), from LongSAGE libraries by deleting seven 5' basepairs from each LongSAGE tag.
One problem that occurred in the SAGE study is that individual tags may have matched to multiple different genes - due to the short length of a tag. We found that the LongSAGE tag maps up to 15 UniGene clusters, while the ShortSAGE and tSAGE tags map up to 279 UniGene clusters. Both long and short SAGE libraries exhibit a large number of orphan tags (no gene information in UniGene), implying the limitation of the UniGene database. Among 100 orphan LongSAGE tags, the complete sequences (17 basepairs) of nine orphan tags match to 17 genomic sequences; four of the orphan tags match to a single genomic sequence. Our data show the potential to resolve 4-9% of orphan LongSAGE tags. Finally, among 400 tSAGE tags showing significant differential expression between AD and control, 79 tags (19.8%) were derived from multiple non-significant LongSAGE tags, implying the false positive results.
Our data show that LongSAGE tags have high specificity in gene mapping compared to ShortSAGE tags. LongSAGE tags show an advantage over ShortSAGE in identifying novel genes by BLAST analysis. Most importantly, the chances of obtaining false positive results are higher for ShortSAGE than LongSAGE libraries due to their specificity in gene mapping. Therefore, it is recommended that the number of corresponding UniGene clusters (gene or ESTs) of a tag for prioritizing the significant results be considered.
Serial Analysis of Gene Expression (SAGE) introduced by Velculescu et al.  is a powerful open source method for profiling transcripts expressed in a given tissue. In this technique, mRNA transcripts are converted to cDNA and then processed 5' to the poly A+ tail to isolate short cDNA fragments called "tags." These tags are linked together into long concatemers and sequenced. The length of a SAGE tag is either 10 (short SAGE tag) or 17 (long SAGE tag) basepairs (bps) following a known restriction site. SAGE results are recorded as a list of distinct tags whose tag frequency can be tabulated to yield a quantitative measure of gene expression. The frequency counts of each SAGE tag reflect the abundance of the respective mRNA transcript expressed in the transcriptome of the tissue or cell type under study. Unlike microarray technology, which is limited to a finite number of known gene sequences arrayed on a chip, SAGE detects all transcripts expressed in a tissue sample and provides more quantitative information than microarrays. However, the disadvantages of SAGE are that the technique is expensive, time and labor intensive, and prone to sequencing errors . Therefore, the total number of SAGE libraries produced for a study is generally smaller than a microarry study.
Annotation for a SAGE tag is a major task for SAGE data analysis. Many resources have been developed for mapping SAGE tags to genes, for instance, the SAGEmap from the National Center for Biotechnology Information (NCBI)  and the SAGE Genie from National Institutes of Health Cancer Genome Anatomy project . Although these tools are useful, they rely on high quality databases to make confident tag-to-gene mapping. With only 14 bps (10 bps+ restriction sites) per a short SAGE (ShortSAGE) tag, it is impossible to directly screen a tag against the whole genome since 14 bps are insufficient to identify a unique genomic locus. UniGene Clusters (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene) is the most frequently used database for searching corresponding transcriptome (e.g. genes or ESTs) of a SAGE tag. if a tag cannot be mapped to a UniGene cluster, it is impossible to determine if the tag is spurious (i.e. mis-sequenced, misincorporation of a nucleotide, not an mRNA), or represents a rare or novel gene not found in the UniGene databases. Therefore, it defeats the purpose of detecting unknown genes using SAGE tags. On the other hand, a LongSAGE tag (21bps: 17 bps + restriction sites) is sufficiently long - making it possible to screen LongSAGE tags directly against the whole genome to identify its unique locus with a reasonable chance of success.
Due to the short length of a SAGE tag, it is common to see that a SAGE tag, especially the ShortSAGE tags, maps to multiple UniGene clusters which may be genes or ESTs,. When multiple genes or ESTs are found for a single tag, it is impossible to differentiate the tag count for genes/ESTs that have the same SAGE tag sequence. Therefore, when such ShortSAGE tag is found to express differentially between two samples, it cannot be determined which gene(s) or EST(s) is expressing differentially. This can lead to serious problems in interpreting gene expression levels between different tissues or states. The longer tags from the LongSAGE libraries may help correct this problem in addition to providing the opportunity to identify new and unique genes.
Although LongSAGE libraries possess several inherent advantages vis-à-vis ShortSAGE libraries, to date, available studies that compared the information content of ShortSAGE and LongSAGE are limited [2, 5]. In addition, previous studies focused more on the tag annotation issue than other topics. Lu et al. generated four LongSAGE libraries using colon cell lines with/without a p53 mutation under either normal oxygen or hypoxia conditions. Based on these four LongSAGE libraries, they generated four ShortSAGE libraries by extracting the 10-bp tags from the longSAGE tags. They limited their analyses on the confident tags, that is, the tags with counts > 1. They concluded that the ShortSAGE more efficiently identifies differentially expressed genes than LongSAGE. They also found that only 4-7% of the redundant confident ShortSAGE tags can be resolved by confident LongSAGE tags. Similarly, van Ruissen et al.  did not find improvement on SAGE tag annotation by LongSAGE tag. That is, both ShortSAGE and LongSAGE have about 30% of tags with reliable annotation. Overall, these studies seem to favor ShortSAGE libraries.
In this study, we investigated various issues related to the information content of LongSAGE and ShortSAGE libraries. Different from Lu et al. , we utilized two types of ShortSAGE libraries. One is modified from the LongSAGE libraries as Lu et al. did. The other is the real ShortSAGE library sequenced from the samples. We generated four SAGE libraries (Two LongSAGE and two ShortSAGE) using human brain tissue samples of two Alzheimer cases and two controls. We attempted to address the following: (1) determine the number of tags that can be matched to UniGene Clusters using LongSAGE and ShortSAGE tags; (2) evaluate tags that we were unable to assign to UniGene Clusters; (3) compare the number of significant differentially expressed genes that can be derived from LongSAGE and ShortSAGE libraries; and (4) investigate the use and potential advantages of LongSAGE tags in identifying novel genes not listed in UniGene database.
Table 1 summarizes the basic tag information for each SAGE library. More than 70,000 tags were extracted from both LongSAGE and ShortSAGE libraries. The number of tag counts per tag ranges from one to 2,202 for long SAGE tags, and one to 1,098 for short SAGE tags. Interestingly, the total tag counts and the numbers of distinct tags (unique tags) were higher in AD than control samples in both LongSAGE and ShortSAGE libraries. For instance, there are 34,475 unique tags in L_AD and 30,581 in L_Ctrl, indicating more tags expressed in the AD than control tissues. Since not all tags are expressed in both libraries of AD and control samples, the number of tags that are expressed in at least one of libraries increases to 55,093 for LongSAGE, 43,937 for tSAGE, and 37,900 for ShortSAGE compared datasets. Furthermore, the overall frequency of SAGE tags mapped to UniGene build 182 for each library is not very high. for instance, we found 14,643 tags (42.5%) in L_AD and 11,646 tags (38.1%) in L_Ctrl that map to the UniGene database, which lead to a large number of orphan tags (no UniGene IDs) in each library (Table 1).
Applying the same strategy described in Lu et al. , we evaluated the tag-to-gene relationship using confident LongSAGE tags, which are defined for the tags with counts > 1. Under this constraint, we still observed more LongSAGE tags in L_AD than L_Ctrl. Interestingly, we observed similar frequencies of redundant short tags. We found that only about 4.9-5.7% of tSAGE tags mapped to multiple LongSAGE tags (Table 2). Further, more than 70% of confident tags can be mapped to UniGene Cluster(s), indicating that the overall low tag-to-gene mapping for each library is mainly coming from those tags with tag counts < 2 (non-confident tags).
As expected, the tag-gene relationship is more specific for the LongSAGE tags than the short SAGE tags. Figure 1 depicts the distribution of tags based on the number of their corresponding UniGene clusters for each compared dataset. The LongSAGE library shows a large percentage of orphan tags (65%) in comparison to tSAGE and ShortSAGE that have about 18% of orphan tags. This is expected, as the probability of mapping to a UniGene Cluster is much smaller for a long SAGE tag due to the extra seven bps. Three compared libraries show a similar percentage of tags mapping to a single UniGene cluster, that is, 32.3% for the LongSAGE, 32.7% for the tSAGE, and 33.1% for the ShortSAGE libraries. However, 97.3% of LongSAGE tags are either orphan tags or map to a single UniGene cluster, while both tSAGE and ShortSAGE libraries still have about 50% of tags mapping to more than one UniGene clusters. The maximum number of UniGene clusters that correspond to a single tag was 15 for the LongSAGE tags, and 279 for both tSAGE and ShortSAGE tags. This may imply that there is a higher chance of obtaining false matches for a ShortSAGE tag than a LongSAGE tag. For instance, of the 17,793 LongSAGE tags that map to a single UniGene cluster, only 5,749 tags map to a single UniGene cluster after converting to the tSAGE tags, and the rest contribute to the pool of tags that map to more than one cluster which may represent false matches. As theorized, the increased specificity in gene mapping offered by the LongSAGE tags is substantial, compared to ShortSAGE tags.
When we compared the expression pattern between AD and control for three types of libraries: LongSAGE, tSAGE, and ShortSAGE, both LongSAGE and tSAGE libraries share strong similarity (Figure 2). This is reasonable as they were based on the same samples. Unexpectedly, S_AD and S_Ctrl show very similar expression levels for the majority of genes, which is different from the case and control samples used for LongSAGE and tSAGE libraries. Our testing results reflected the expression patterns in Figure 2. We detected 380 LongSAGE tags, 400 tSAGE tags, and 156 ShortSAGE tags with significant differential expression between AD and control (P< 0.05). Clearly, we detected fewer tags in the ShortSAGE dataset than the other two. Although significant, this difference could be due to gene expression variation between samples with the same disease status.
Since both LongSAGE and tSAGE libraries were derived from the same samples, we used these two datasets to measure the relative ability of long and short SAGE libraries to detect altered gene expression. We found that the 400 significant differentially expressed tSAGE tags were derived from 336 significant and 1,425 non-significant LongSAGE tags. We assigned each tSAGE tag to one of three categories that are defined based on the testing results of its corresponding long tags: (1) Positive group, if all corresponding LongSAGE tags for the tSAGE tag are significant; (2) Negative group, if all corresponding LongSAGE tags for the tSAGE tag are not significant; or (3) Either group, if the corresponding LongSAGE tags for the tSAGE tag are a combination of significant and non-significant. Figure 3 depicts the relationship between the 400 significant tSAGE tags and their corresponding LongSAGE tags in these three groups. The 400 tSAGE tags distributed as 156 tSAGE tags in the Positive group, 79 in Negative group, and 165 in the Either group. Interestingly, each tSAGE tag in the Positive group was derived from a single LongSAGE tag, but the tag in both Negative and Either groups was derived from at least two LongSAGE tags. The maximum number of corresponding LongSAGE tags for a tSAGE tag was 114 for the Negative group and 68 for the Either group. We also examined the number of UniGene clusters that mapped to each of the 400 significant tSAGE tags. The tSAGE tags in the Positive group mapped up to seven genes, while the tags in the Negative group and Either group mapped up to 108 and 66 genes, respectively. Overall, the significant tSAGE tags in both Negative and Either groups tend to map to more LongSAGE tags and known genes.
One of the most interesting findings is the analysis of orphan tags. The BLAST (http://www.ncbi.nlm.nih.gov/BLAST/) analysis for the 100 randomly selected orphan tags revealed 17 orphan tags with at least 17 bps in the tag completely matched to a gene sequence in human . This frequency (17%) is close to the probability of obtaining one gene sequence perfectly matched to 17 bps of a given tag under an assumed human genome size of 2.864×109 bps (14%) and equal frequency of each nucleotide occurred at a base. The number of matched gene sequences for an orphan tag increases as the number of matched bps decreases (Table 3). A total of 39 gene sequences were identified through this approach. Since the tag sequence used in the BLAST analysis consists of four bps (nucleotide position one to four) from the restriction site and 17 bps (nucleotide position five to 21) from the SAGE tag, we also restricted our selection to tags that have at least all 17 bps in the tag region which match to a gene sequence. The reason for this is that sequencing errors are more likely in the restriction sites rather than in the tag region. Under these criteria, the ending position of the matched segment in the tag sequence is always 21 and the starting position needs to be less than or equal to five. We found nine orphan tags that met these criteria (Table 4). Four of nine orphan tags matched to a single human gene sequence - with 21, 20, and 18 matched bps, which are more likely to be the real transcripts for these four orphan tags.
The use of SAGE libraries has been advocated, but technical complexity has limited their use. In addition, the value of long vs. short tag SAGE has not been widely explored. A few facts for a SAGE study are listed below. First, the tSAGE libraries share similar numbers of unique tags and tag counts with the "real" ShortSAGE libraries. The small differences between tSAGE and ShortSAGE libraries may be simply due to the variation between samples. These outcomes imply one advantage for the LongSAGE libraries as they can be analyzed in two ways (as long or short tags). Second, to reach a similar number of total tag counts, LongSAGE libraries, due to greater tag length, need to sequence more clones than ShortSAGE libraries, resulting in increased time and cost. Third, a large number of orphan tags exist in both LongSAGE and ShortSAGE libraries. In fact, LongSAGE libraries have more orphan tags than ShortSAGE libraries - due to their greater specificity in gene mapping.
Identifying differentially expressed genes between tissue samples is often the goal in conducting expression studies. Conclusions on what constitutes a significant change in gene expression are usually guided by the p-values derived from statistical tests. One important feature of our study is our investigation on the potential and serious problem of identifying wrong genes using SAGE libraries, especially ShortSAGE libraries. Are genes or ESTs corresponding to a significant differentially expressed tag real? By utilizing both LongSAGE and tSAGE libraries, we showed that likely only 156 out of 400 significant tSAGE tags (39%) are the presumed true significant tags, because they were derived from significant LongSAGE tags. On the other hand, the 79 significant tSAGE tags in the Negative group are probably not truly differentially expressed, because none of their corresponding LongSAGE tags are significant. Since the tag count for a tSAGE tag is the sum of tag counts from its corresponding LongSAGE tags, a false positive result of a tSAGE tag may simply be due to its mapping to multiple LongSAGE tags. In a real setting, this problem will exist for a tag that maps to multiple genes or ESTs. When there are only ShortSAGE data available, we will not be able to dissect the tag-gene relationship as described here. We may make a wrong decision by concluding a significant short SAGE tag by simply looking at the p-value, even if the p-values are very small.
Since all 156 tSAGE tags in the Positive group (the presumed true significant tSAGE tags) map to a single LongSAGE tag that has high specificity in tag-to-gene mapping, one potential solution is to take into account the number of UniGene clusters mapped to a tag in the decision making process. Among the 156 tSAGE tags in the Positive group (the presumed true significant ones), 67% of tags match to two UniGene clusters. On the other hand, 53% of tSAGE tags in the Negative group (the false ones) mapped to more than two UniGene clusters. If we treat the tags that map to two or fewer UniGene clusters as the presumed true significant tags, we will only include 47% of false ones, which is better than including all tags with false positive results.
Through this paper, our tag-to-gene mapping analysis relies on the UniGene database. However, a UniGene cluster does not always imply a gene. It is possible that multiple UniGene clusters refer to the same gene. In our LongSAGE tags analysis, we found that 97.3% of LongSAGE tags are either orphan tags or mapped to a single UniGene Cluster, which is less likely to produce ambiguity of tag-to-gene mapping. for the remaining 2.7% of LongSAGE tags, 1.9% (1044 tags) map to two UniGene clusters. while it is not our main focus to dissect the property of each UniGene cluster in this paper, we found that 10.7% of 1044 LongSAGE tags have the same description for the two clusters even though their UniGene IDs are different. Therefore, it is possible that some of these LongSAGE tags are in fact mapping to a single gene, which may increase the specificity of tag-to-gene mapping for LongSAGE tags.
The large number of orphan tags also represents the limitation of the UniGene database. We showed that there is a potential to use long SAGE tags to identify novel genes that are not listed in the UniGene database. Unlike the short SAGE tag, the long SAGE tag has a sufficient number of nucleotides - allowing us to perform BLAST analysis to search for novel genes. In this study, our criteria in BLAST analysis is to search for at least 17 bps of a SAGE tag matched to a human gene sequence without any gaps. Under this search, we were able to identify 39 genes for 17 orphan tags. More specifically, nine orphan tags were found to have the full 17 bps within the tag region, matching to a human gene sequence, and the number of genes identified reduced to 17. The best results in our BLAST search are the four orphan tags that matched to a single gene sequence by 21, 20, and 18 bps. Considering the probability of obtaining one matched gene sequence is as low as 0.07% for 21 bps, 0.3% for 20 bps, and 4% for 18 bps for a genome size of 2.864×109 bps, it is highly possible that these are real genes corresponding to these four orphan tags. From this BLAST study, we should be able to resolve 4–9% of orphan tags. Although we surveyed only 100 orphan tags, these results are encouraging because we will potentially be able to expand the number of known genes using the LongSAGE library.
Although our SAGE libraries cannot represent other SAGE studies, it provides a good example that one can filter significant tags based on the number of their corresponding genes. In general, it would seem reasonable to use, at most, two corresponding genes as a cutoff to filter significant ShortSAGE tags. Further, if a project aims to be more exclusive in the process of gene selection, one can use the most conservative approach to exclude all significant tags that map to more than one gene.
The LongSAGE exhibits advantages over ShortSAGE libraries in several aspects. LongSAGE tags appear to have higher specificity in gene mapping than ShortSAGE tags. LongSAGE tags show an advantage over ShortSAGE in identifying novel genes by BLAST analysis, which may help to reduce the number of orphan tags. Most importantly, LongSAGE libraries have advantages in identifying genes that are truly expressed differently between samples, compared to ShortSAGE libraries. In addition we will still be able to perform analysis based on ShortSAGE tags using LongSAGE libraries. This makes the extra costs and experimental time that a LongSAGE library needs worthwhile.
Human brain samples and pathological assessment
Human brain tissues were collected in the Kathleen Price Bryan Brain Bank at the Duke University Alzheimer Disease Research Center (ADRC) and in the Brain Bank of the Center for Human Genetics (CHG) at Duke University Medical Center (DUMC), following the rapid autopsy protocol . The hippocampus was dissected at the time of autopsy, and matching 100-200 mg portions of CA 1–4 were removed and used for RNA isolation and expression studies. Four brain tissue samples, including two AD (Sample IDs: 470 and 589) and two controls (sample IDs: 673 and 707), used in this study were previously described in Xu et al . All four samples have the same apolipoprotein E 3/3 genotype (APOE3/3). The pathological diagnosis of AD was established according to CERAD criteria , and the degree of AD pathological changes was staged according to Braak . The AD patients used in this study have pathological changes at the Braak and Braak stage IV and V (B&B stage IV and V), and the control was cognitively and pathologically normal with B&B stage I. Post-mortem delay times ranged from 1:10 to 4:15 hours .
RNA isolation for SAGE library construction
Total RNA was isolated from frozen hippocampus samples of AD patients and controls using TRIzol reagent (Invitrogen) according to the manufacturer's instructions. Briefly, brain tissue was homogenized in TRIzol reagent by Dounce homogenization and the homogenized samples were incubated for five minutes at room temperature. After the addition of chloroform, the mixture was centrifuged to separate the RNA containing aqueous phase from the TRIzol reagent. The aqueous phase was transferred to a fresh tube and the RNA precipitated after adding 0.5 volume of isopropyl alcohol. The RNA pellet was washed once with 75% ethanol, dried, and resuspended in DEPC treated water and stored at -80 °C.
Construction of human hippocampus SAGE libraries
For ShortSAGE library construction, standard protocols as described by Velculescu et al , and Basrai and Hieter  were used with minor modifications. Briefly, SAGE was performed with 10 µg total RNA isolated from human brain hippocampus samples as outlined above. The cDNA was prepared using the SuperscriptII cDNA synthesis kit (Invitrogen) with gel-purified 5'-biotinylated Oligo(dT)18 (Integrated DNA Technologies, Coralville, IA), according to the manufacturer's protocol. Nla III and Bsm FI restriction enzymes (New England Biolab, Beverly, MA) were used for tag generation. Bsm FI digestion was performed at 37°C for 2.5 h (instead of 65 °C) using 40 units Bsm FI in a 300 µl reaction volume with supplied buffer. After a three-hour concatemerization step, the concatemers were heated at 65 °C for 10 minutes, followed by two minutes on ice to enhance cloning efficiency. Purified concatemers were subsequently cloned in the Sph I site of pZero-1 (Invitrogen) and transformed in competent ElectroMax DH10B cells (Invitrogen) using a 0.1 cm cuvette and the Gene Pulser II (BioRad). Individual SAGE library clones were selected and PCR amplified using 96-well format Qiagen Real minipreps, and sequenced with ABI 3700 capillary sequencer using BigDye chemistry.
LongSAGE library construction was performed with 10µg total RNA using the standard SAGE protocol with the modifications according to Saha, et al. . We used the MmeI type IIS restriction endonuclease (New England Biolab) to release the linker tag molecules from the cDNA.
SAGE Tag Extraction
ShortSAGE tags (10 bps) were extracted from the PHD files with eSAGE software, using a threshold value of PHRED 20 for each base (Margulies and Innis 2000). The SAGE tags were compared between the ShortSAGE AD (S_AD) and ShortSAGE control (S_Ctrl) library using eSAGE software to form a compared ShortSAGE database. LongSAGE tags (17 bps) were extracted from raw sequence data of LongSAGE libraries using SAGE2000 version 4.5 Analysis Software. We directly merged the SAGE tags from the LongSAGE AD (L_AD) and LongSAGE control (L_Ctrl) libraries to generate a compared LongSAGE database. Both compared ShortSAGE and LongSAGE databases were mapped to UniGene build 182 (National Center for Biotechnology Information, NCBI).
SAGE Data Analysis
In addition to the four SAGE libraries described above, we used the same strategy employed by Lu et al.  to generate two additional short SAGE libraries based on the LongSAGE libraries. We truncated the seven 5' bps of each long SAGE tag to generate truncated LongSAGE (tSAGE) library, which is analogous to the ShortSAGE library - as each tSAGE tag has only 10 bps. The tag count of a tSAGE tag is the sum of tag counts of LongSAGE tags that have the same first 10 bps. Hereafter, we refer to the two tSAGE libraries as T_AD for the tSAGE AD library and T_Ctrl for the tSAGE control library. Similarly, we generated and compared a SAGE database for T_AD and T_Ctrl, and mapped tSAGE tags to UniGene build 182. This allows us to directly compare results for long and short SAGE (i.e. LongSAGE and tSAGE) tags derived from the same tissue samples. We utilized these six libraries (three compared SAGE databases) to investigate the information content of long and short SAGE libraries.
First, the data was summarized for these six SAGE libraries. We computed the number of unique tags, the total tag counts, the number of tags that map to UniGene, and the number of tags with no UniGene information (i.e. the orphan tags) for each library. We also evaluated the specificity of the long and short SAGE tags for gene mapping. We computed the number of genes corresponding to each tag for the three compared SAGE datasets. To estimate the percentage of redundant short SAGE tags that can be resolved by the long SAGE tags, we mimicked the approach of Lu et al.  using the LongSAGE and tSAGE libraries. We obtained a set of unique LongSAGE tags with tag counts greater than one. Then, we computed the numbers of unique and redundant tSAGE tags that correspond to these LongSAGE tags. In other words, these redundant tSAGE tags can be resulted if their corresponding LongSAGE tags are known. Further, we investigated the tag-to-gene mapping pattern of the tSAGE tags that originally map to a single UniGene cluster under the LongSAGE tag format.
Second, we examined the performance of the LongSAGE, tSAGE, and ShortSAGE libraries in identifying differentially expressed genes. Chi-square and Fisher exact tests, as previously described , were used to test differences in expression levels between AD and control for each tag in each compared SAGE dataset. Since it is not our goal to provide a set of candidate genes, but rather use the results to compare the relationship between significant short and long SAGE tags, we applied a nominal significance level of 0.05 to declare significant tags without considering a correction for multiple testing. We summarized the number of significant tags for each compared SAGE dataset. For all significant tSAGE tags, we investigated the number of its corresponding long tags. We compared the LongSAGE tag counts per tSAGE tag among three groups.
Finally, UniGene serves as a database to interpret the SAGE tags. Each UniGene cluster contains sequences that represent a unique gene or EST. Since the UniGene set is based on expressed mRNAs, it represents only a small portion of the genome. Although there are more than 53,000 unique UniGene entries, a large number of orphan tags are still found in both ShortSAGE and LongSAGE libraries. Here, we investigate whether LongSAGE tags can help us identify genes corresponding to these orphan tags and whether they represent real genes or are artifacts of library construction and analysis. Since the maximum length of a LongSAGE tag is up to 21 bps (including the cut site), it is possible to search genes corresponding to these long tags using sequencing alignment tools, such as BLAST. BLAST finds regions of local similarity between DNA sequences. Under the assumption of equal probability of sampling a nucleotide at each base, the probability of obtaining an exact matched sequence with k bps is (¼)k. Assuming that the human genome consists with N bps of nucleotides, the approximated probability of obtaining one matched chromosomal segment with k bps is
if all chromosomal segments of k bps are independent, and the expected number of chromosomal segments that match to a tag with k bps is (N-k+1)(¼)k. The number of genes matching to a given tag decreases as the number of required matched bps (k) in the tag increases. If we assume that the human genome consists with 2.864×109 bps of nucleotides (Goden path length at http://www.ensembl.org/Homo_sapiens/index.html), we may expect to find 10 sequence segments matched to a 14-bp tag sequence. This number reduces to less than one when we require the number of bps to match to a tag to be 16+ bps. Clearly, a larger k will have a higher accuracy in gene identification than a smaller k. Based on the above calculations; we used 17+ bps as our search criteria in BLAST analysis. However, this computation did not take into account some genes that may be highly homologous to each other. Here, we examine the frequencies of obtaining perfect matched gene sequences for orphan tags through BLAST analysis. A gene sequence is considered a perfect match with an orphan tag if a gene sequence has a segment matched to a complete portion of a tag, that is, no gaps (unmatched nucleotides) within the sequence are allowed. We randomly selected 100 orphan LongSAGE tags from the L_Ctrl library and screen the 21 bps LongSAGE tag sequences by BLAST. We selected the tags that show a perfect match to human genes with at least 17 bps.
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science 1995, 270: 484–487. 10.1126/science.270.5235.484
van Ruissen F, Ruijter JM, Schaaf GJ, Asgharnegad L, Zwijnenburg DA, Kool M, Baas F: Evaluation of the similarity of gene expression data estimated with SAGE and Affymetrix GeneChips. BMC Genomics 2005, 6: 91-. 10.1186/1471-2164-6-91
Lash AE, Tolstoshev CM, Wagner L, Schuler GD, Strausberg RL, Riggins GJ, Altschul SF: SAGEmap: a public gene expression resource. Genome Res 2000, 10: 1051–1060. 10.1101/gr.10.7.1051
Boon K, Osorio EC, Greenhut SF, Schaefer CF, Shoemaker J, Polyak K, Morin PJ, Buetow KH, Strausberg RL, De Souza SJ, Riggins GJ: An anatomy of normal and malignant gene expression. Proc Natl Acad Sci USA 2002, 99: 11287–11292. 10.1073/pnas.152324199
Lu J, Lal A, Merriman B, Nelson S, Riggins G: A comparison of gene expression profiles produced by SAGE, long SAGE, and oligonucleotide chips. Genomics 2004, 84: 631–636. 10.1016/j.ygeno.2004.06.014
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389
Hulette CM, Welsh-Bohmer KA, Crain B, Szymanski MH, Sinclaire NO, Roses AD: Rapid brain autopsy. The Joseph and Kathleen Bryan Alzheimer's Disease Research Center experience. Arch Pathol Lab Med 1997, 121: 615–618.
Xu PT, Li YJ, Qin XJ, Scherzer CR, Xu H, Schmechel DE, Hulette CM, Evin J, Gullans SR, Haines J, Pericak-Vance MA, Gilbert JR: Differences in apolipoprotein E3/3 and E4/4 allele-specific gene expression in hippocampus in Alzheimer disease. Neurobiol Dis 2006, 21: 256–275. 10.1016/j.nbd.2005.07.004
Mirra SS, Heyman A, McKeel D, Sumi SM, Crain BJ, Brownlee LM, Vogel FS, Hughes JR, van Belle G, Berg L: The Consortium to Establish a Registry for Alzheimer's Disease (CERAD). Part II. Standardization of the neuropathologic assessment of Alzheimer's disease. Neurology 1991, 41: 479–486.
Braak H, Braak E: Neuropathological stageing of Alzheimer-related changes. Acta Neuropathol (Berl) 1991, 82: 239–259. 10.1007/BF00308809
Basrai MA, Hieter P: Transcriptome analysis of Saccharomyces cerevisiae using serial analysis of gene expression. Methods Enzymol 2002, 350: 414–444.
Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE: Using the transcriptome to annotate the genome. Nat Biotechnol 2002, 20: 508–512. 10.1038/nbt0502-508
Hauser MA, Li YJ, Takeuchi S, Walters R, Noureddine M, Maready M, Darden T, Hulette C, Martin E, Hauser E, Xu H, Schmechel D, Stenger JE, Dietrich F, Vance J: Genomic convergence: identifying candidate genes for Parkinson's disease by combining serial analysis of gene expression and genetic linkage. Hum Mol Genet 2003, 12: 671–677. 10.1093/hmg/12.6.671
We thank the Alzheimer disease patients and individuals who participated in the autopsy program at Duke University, and the support of their family members. We also thank the clinical and research personnel of the CHG at the DUMC and the Joseph and the Kathleen Bryan ADRC. This work was supported by a Zenith Award (ZEN -01-2935) from the Alzheimer's Association; the 2001 Louis D. award from the Institut de France, as well as grants AG019757, AG005128, and AG021547 from the National Institutes of Health.
YJL supervised statistical analysis, drafted and revised the manuscript, and is responsible for the content of the paper. PX generated SAGE libraries used in this study and also help manuscript preparation. XQ performed the data analysis. DES involved in patient ascertainment. CMH involved in autopsy works. JLH and MAP are the PIs of Alzheimer studies and grants which funded part of the research. They provided samples available for this study. JRG supervised molecular biology components and helped manuscript preparation.
Yi-Ju Li, Puting Xu, Xuejun Qin, Donald E Schmechel, Christine M Hulette, Jonathan L Haines, Margaret A Pericak-Vance and John R Gilbert contributed equally to this work.
About this article
Cite this article
Li, YJ., Xu, P., Qin, X. et al. A comparative analysis of the information content in long and short SAGE libraries. BMC Bioinformatics 7, 504 (2006). https://doi.org/10.1186/1471-2105-7-504