PET-Tool: a software suite for comprehensive processing and managing of Paired-End diTag (PET) sequence data
© Chiu et al; licensee BioMed Central Ltd. 2006
Received: 27 June 2006
Accepted: 25 August 2006
Published: 25 August 2006
We recently developed the Paired End diTag (PET) strategy for efficient characterization of mammalian transcriptomes and genomes. The paired end nature of short PET sequences derived from long DNA fragments raised a new set of bioinformatics challenges, including how to extract PETs from raw sequence reads, and correctly yet efficiently map PETs to reference genome sequences. To accommodate and streamline data analysis of the large volume PET sequences generated from each PET experiment, an automated PET data process pipeline is desirable.
We designed an integrated computation program package, PET-Tool, to automatically process PET sequences and map them to the genome sequences. The Tool was implemented as a web-based application composed of four modules: the Extractor module for PET extraction; the Examiner module for analytic evaluation of PET sequence quality; the Mapper module for locating PET sequences in the genome sequences; and the ProjectManager module for data organization. The performance of PET-Tool was evaluated through the analyses of 2.7 million PET sequences. It was demonstrated that PET-Tool is accurate and efficient in extracting PET sequences and removing artifacts from large volume dataset. Using optimized mapping criteria, over 70% of quality PET sequences were mapped specifically to the genome sequences. With a 2.4 GHz LINUX machine, it takes approximately six hours to process one million PETs from extraction to mapping.
The speed, accuracy, and comprehensiveness have proved that PET-Tool is an important and useful component in PET experiments, and can be extended to accommodate other related analyses of paired-end sequences. The Tool also provides user-friendly functions for data quality check and system for multi-layer data management.
Tag-based sequencing strategies such as Serial Analysis of Gene Expression (SAGE) are efficient for analyzing DNA fragments in transcriptome characterization and genome annotation studies [1–3]. However, the information content in each SAGE tag based on an anchored restriction enzyme recognition site within the DNA segment is limited, and the mapping of SAGE tags to genome sequences for transcript identification can be ambiguous. Despite the recent improvements in tagging 5' terminal signatures of cDNA [4, 5] to determine transcription start sites (TSS), the most significant advance in this field is the simultaneous tagging of 5' and 3' terminal signatures of DNA fragments subjected to study. In this effort, we first developed an intermediate approach that precisely extracts separate 5' and 3' terminal tags from cDNA fragments for sequencing . With this new capability, we proceeded to design and develop a cloning strategy, called Gene Identification Signature (GIS) analysis, which covalently links the 5' and 3' signatures of each full-length transcript into a Paired-End diTag (PET) structure . In a GIS-PET experiment, most of the PETs are 36bp in length (18bp for the 5' signature tag and 18bp for the 3' signature tag); and multiple PETs can be concatenated together to form longer stretches of DNA fragments for efficient high-throughput sequencing. An average sequencing read (700–800bp) of a GIS-PET library clone can reveal 10–15 PET units, which is equivalent to 30 conventional cDNA sequencing reads for 15 cDNA clones analyzed from both ends. The PET sequences can then be accurately mapped to the reference genome sequences and precisely demarcate the boundaries of transcription units in the genome landscape. With this combined efficiency and accuracy of GIS-PET, a mammalian transcriptome can be thoroughly analyzed using hundreds of thousands high quality transcript sequences by a modest sequencing effort as further demonstrated in the comprehensive characterization of mouse transcriptomes . The PET-based DNA analysis strategy has also been applied to characterize genomic DNA fragments generated by chromatin immunoprecipitation (ChIP) enriched for specific binding targets by given DNA-binding proteins, and whole genome ChIP-PET data has provided global maps of transcription factor binding sites for p53 in the human genome  and Oct4 and Nanog in the mouse genome . PET-based DNA analyses (GIS-PET and ChIP-PET) promise to play a significant role in the post-genome efforts to identify all functional elements in the human genome , and there is no inherent limit for the PET-based approach to be applied to other DNA analyses, such as analyses of epigenetic elements.
To fully appreciate the potential of PET-based sequencing analyses, we have to develop sophisticated informatics capabilities to manage the large volume of specific PET sequences generated from each of the GIS-PET and ChIP-PET experiments. There is a battery of new bioinformatics challenges around how to accurately identify and extract PET sequences embedded in raw sequence reads, how to specifically and efficiently map the paired 5' and 3' signatures of PET sequences in complex genomes such as the human and mouse genome sequences; and how to be user-friendly in managing the immense amount of data generated from GIS-PET and ChIP-PET experiments for effective data mining and analysis. Based on the paired end nature of PET sequences generated from GIS-PET and ChIP-PET experiments, the issues are far more complicated than those related to SAGE-like mono-tags and therefore can not be handled by available software packages previously developed for SAGE analysis [12–15].
To accommodate and process PET sequence data, we developed a complete software suite called PET-Tool that is designed to provide complete solutions starting from extracting PET sequences from raw sequencing reads, to mapping the PET sequences to the reference genomes. Here in this study, we describe the architecture design, technical details of implementation, utility, and robustness of PET-Tool by analyzing four datasets generated from two GIS-PET libraries and two ChIP-PET libraries.
The architecture of PET-Tool
PET-Tool is implemented for both UNIX and LINUX. The web-based user interface is implemented in Perl/CGI and hosted by Apache web server. The interface of the Tool can be accessed by any web-browser that supports the current web standards.
Data storage is facilitated by a combination of flat file system and mySQL based Relational Database Management System (RDBMS). The mySQL database was used for efficient and fast PET data storage, tracking, retrieving, and interfacing with back-end programs through Perl:DBI module. We also applied mySQL to host various statistical data and mapping results. Flat files were used for storage of uploaded sequence data, with the positional indices of all sequences stored in mySQL database for quick sequence retrieval. Back-end programs were implemented in Perl and C languages. Compressed Suffix Array (CSA) programs were implemented in C language for high efficiency and robust performance of advanced data structures. Programs for PET sequence extraction, statistic computation, data retrieval/storage, web-interaction and other non-intensive tasks were implemented in Perl. Minimum hardware requirements include Pentium III processor, CPU of 500 MHz, 256 Mega byte RAM, and 20 Giga-byte hard disk drive. A regular 500 MHz machine would take about two days to process a library of one million PETs. If a computer was equipped with 2.4 GHz processor, the same job could be done in a few hours.
Results and discussion
The current settings of PET-Tool can handle GIS-PET for transcriptome analysis and ChIP-PET data for whole genome localization of transcription factor binding sites. We have successfully applied PET-Tool to more than 45 GIS-PET and ChIP-PET libraries. To demonstrate the data processing workflow, and the functionalities and performance of PET-Tool, we analyzed two GIS-PET libraries and two ChIP-PET libraries in this study.
Datasets used for analysis in this study
Statistics of PET characteristics
Raw sequence reads
Rejected poor PETs
Rejection rate % *
Total high-quality PETs
Total unique PETs
Redundancy % **
5' AT content (%)
3' AT content (%)
Breakdown of rejected PETs
Length < 34
Length > 40
No AA-tail at 3' end
PolyA(9) in 3' tag
PolyA(9) in 5' tag
PolyT(9) in 5' tag
PolyT(9) in 3' tag
PET-Tool procedure to process the PET library sequences
PET sequences generated from GIS-PET and ChIP-PET libraries
The spacer-defined raw PETs were then subjected to serial steps of filtering to exclude incorrect PETs due to imperfect molecular reactions during the molecular cloning process. It is known that the TypeIIs restriction enzymatic cleavage, DNA end polishing, and ligation reactions have a certain level of slippage, and the combination of these reactions would contribute to deviation of actual PET lengths from the predicted PET lengths by one to several nucleotides . Hence, we have set an empirical range (34–40bp) around the expected size (36bp) for true ditags. Other ditags that were either shorter than 34 bp or longer than 40 bp were considered experimental artifacts, and therefore were removed from further analysis. PET sequences with low complexity (homopolymer stretches of more than 8 consecutive same nucleotides such as As or Ts, etc) were also removed because these PETs lack sufficient specificity for mapping to reference genome sequences. As an indication of PET orientation, we kept an "AA" residue of the cDNA polyA tail in the PET sequences at 3' end in GIS-PET libraries. Therefore, if any GIS-PET ditags did not contain the AA tail at the 3' end, these questionable PETs were also removed. After these layers of filtering, 864,964 high quality PETs were collected for the two GIS-PET libraries and 1,489,412 high quality PETs for the ChIP-PET libraries. Redundant PETs were collapsed into unique PETs. The copy numbers for each of the unique PETs reflect the abundance level of the PET in a given library. In total, 135,757 unique PETs were collected for SCH012, 145,138 for SCH013, 640,844 for SCH016, and 582,253 for SCH019 (Table 1).
Evaluation of PET quality using examiner
Comparison of PET sequences derived from GIS-PET and ChIP-PET libraries
Although the methods used to generate GIS-PET and ChIP-PET were similar, the starting DNA materials were rather different. GIS-PETs were derived from cDNA, while ChIP-PETs were derived from ChIP enriched genomic DNA fragments. It appears that the quality of GIS-PETs is lower than that of ChIP-PETs. About 22% of GIS-PET sequences as opposed to 8.2% of ChIP-PET sequences were rejected after quality filtering. There are several reasons contributing to higher error rates for GIS-PETs. One of the major differences between GIS-PET and the ChIP-PET was the inclusion of AA-tail as a 3' directional indicator at the end of 3' signature for each GIS-PET sequence. We observed that 30% of the rejected GIS-PETs lacked the appropriate AA-tail. We also observed that the AT content in GIS-PETs was significantly polarized, at 31% for the 5' tag region and 61% for the 3' tag region. This observation is in consistent with our knowledge that in transcripts or cDNAs, 5' UTR (un-translated region) is GC rich and 3' UTR is AT rich . In contrast, the 3'-prone polarization of AT content was not observed in ChIP-PET sequences because the ChIP DNA fragments were generated by randomly shearing of genomic DNA.
Mapping of PET sequences to reference genome sequences
PET mapping to the genome
(un-mapping rate %)
(mapping rate %)
PETs mapped to one location (PET1)
(PET1 rate% to unique PETs)
(PET1 rate % to mapped PETs)
(PET1 tag total counts)
(PET1 redundancy %)*
PET1 mapped to known genes
(PET1 rate % to known genes)
PETs mapped to multiple locations (% to all mapped PETs)
We have developed a comprehensive computation program package, PET-Tool, to accommodate demands for automated processing of large volume of PET sequences generated by PET-based experiments. We demonstrated the utility of PET-Tool by analyzing four PET libraries and more than 2.7 millions PET sequences, and proved that PET-Tool can accurately and efficiently dissect PET concatemer sequences, extract, organize PET sequences in a relational database for convenient evaluation of sequence quality and overall experimental integrity, and specifically map the PET sequences to the corresponding reference genome sequences.
Availability and requirements
Project name: PET-Tool; Project home page: http://www.gis.a-star.edu.sg/PET\_Tool Operating system(s): UNIX and LINUX; Programming language: Perl and C languages.
PET-Tool is free for non-commercial use. The complete package of PET-Tool is available in DVD format to be sent upon request, and downloadable from the PET-Tool home page. For users who would like to understand more of the PET methodology, a detailed experimental protocol and a user manual are also available at the PET-Tool website.
Serial Analysis of Gene Expression
Gene Identification Signature
Compressed Suffix Array
Transcription Start Site
The authors want to thank Mr. Charlie Lee for participation in webpage design, and Drs. Patrick Ng and Guillaume Bourque for invaluable suggestions. This work is supported by the Agency for Science, Technology and Research (A*STAR) of Singapore and the NIH/NHGRI (1R01HG003521-01).
- Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science 1995, 270: 484–487.View ArticlePubMed
- Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW: Using the transcriptome to annotate the genome. Nature Biotechnol 2002, 20: 508–512. 10.1038/nbt0502-508View Article
- Wang TL, Maierhofer C, Speicher MR, Lengauer C, Vogelstein B, Kinzler KW, Velculescu VE: Digital karyotyping. PNAS USA 2002, 99: 16156–16161. 10.1073/pnas.202610899PubMed CentralView ArticlePubMed
- Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, Watahiki A, Nakamura M, Arakawa T, Fukuda S, Sasaki D, Podhajska A, Harbers M, Kawai J, Carninci P, Hayashizaki Y: Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. PNAS USA 2003, 100: 15776–15781. 10.1073/pnas.2136655100PubMed CentralView ArticlePubMed
- Hashimoto SI, Suzuki Y, Kasai Y, Morohoshi K, Yamada T, Sese J, Morishita S, Sugano S, Matsushima K: 5' end SAGE for the analysis of transcriptional start sites. Nature biotechnology 2004, 22: 1146–1149. 10.1038/nbt998View ArticlePubMed
- Wei CL, Ng P, Chiu KP, Wong CH, Ang CC, Lipovich L, Liu ET, Ruan Y: 5' long serial analysis of gene expression (LongSAGE) and 3' LongSAGE for transcriptome characterization and genome annotation. PNAS USA 2004, 101: 11701–11706. 10.1073/pnas.0403514101PubMed CentralView ArticlePubMed
- Ng P, Wei CL, Sung WK, Chiu KP, Lipovich L, Ang CC, Gupta S, Sha-hab A, Ridwan A, Wong CH, Liu E, Ruan Y: Gene identifica-tion signature (GIS) analysis for transcriptome characterization and genome An-notation. Nature Methods 2005, 2: 105–111. 10.1038/nmeth733View ArticlePubMed
- The FANTOM Consortium: The transcriptional landscape of the mammalian genome. Science 2005, 309: 1559–1563. 10.1126/science.1112014View Article
- Wei CL, Wu Q, Vega V, Chiu KP, Ng P, Zhang T, Shahab A, Ridwan A, Fu YT, Weng Z, Liu JJ, Kuznetsov VA, Sung K, Lim B, Liu E, Chan QY, Ng HH, Ruan Y: A global mapping of p53 transcription factor binding sites in the human genome. Cell 2006, 124: 207–219. 10.1016/j.cell.2005.10.043View ArticlePubMed
- Loh YH, Wu Q, Chew JL, Vega VB, Zhang W, Chen X, Bourque G, George J, Leong B, Liu J, Wong KY, Sung KW, Lee CWH, Zhao X-D, Chiu K-P, Lipovich L, Kuznetsov VA, Robson P, Stanton LW, Wei CL, Ruan Y, Lim B, Ng HH: The Oct4 and Nanog transcription network that regulates pluripotency in mouse embryonic stem cells. Nature Genetics 2006, 38: 431–440. 10.1038/ng1760View ArticlePubMed
- The ENCODE Project Consortium: The ENCODE (ENCyclopedia of DNA Elements) Project. Science 2004, 306: 636–640. [http://www.genome.gov/Pages/Research/ENCODE/] 10.1126/science.1105136View Article
- Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW: Gene expression profiles in normal and cancer cells. Science 1997, 276: 1268–1272. 10.1126/science.276.5316.1268View ArticlePubMed
- van Kampen AHC, van Schaik BDC, Pauws E, Michiels EMC, Ruijter JM, Caron HN, Versteeg R, Heisterkamp SH, Leunissen JAM, Baas F, van der Mee M: USAGE: a web-based approach towards the analysis of SAGE data. Bioinformatics 2000, 16: 899–905. 10.1093/bioinformatics/16.10.899View ArticlePubMed
- Lash AE, Tolstoshev CM, Wagner L, Schuler GD, Strausberg RL, Riggins GJ, Altschul SF: SAGEmap: A public Gene Expression Resource. Genome Research 2000, 10: 1051–1060. 10.1101/gr.10.7.1051PubMed CentralView ArticlePubMed
- Bala P, Georgantas RW 3, Sudhir D, Suresh M, Shanker K, Vrushabendra BM, Civin CI, Pandey A: TAGmapper: a web-based tool for mapping SAGE tags. Gene 2005, 364: 123–9. 10.1016/j.gene.2005.05.044View ArticlePubMed
- Louie E, Ott J, Majewski J: Nucleotide frequency variation across human genes. Genome Research 2003, 13: 2594–2601. 10.1101/gr.1317703PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.