DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions
© Arner et al; licensee BioMed Central Ltd. 2006
Received: 10 November 2005
Accepted: 20 March 2006
Published: 20 March 2006
Many genome projects are left unfinished due to complex, repeated regions. Finishing is the most time consuming step in sequencing and current finishing tools are not designed with particular attention to the repeat problem.
We have developed DNPTrapper, a shotgun sequence finishing tool, specifically designed to address the problems posed by the presence of repeated regions in the target sequence. The program detects and visualizes single base differences between nearly identical repeat copies, and offers the overview and flexibility needed to rapidly resolve complex regions within a working session. The use of a database allows large amounts of data to be stored and handled, and allows viewing of mammalian size genomes. The program is available under an Open Source license.
With DNPTrapper, it is possible to separate repeated regions that previously were considered impossible to resolve, and finishing tasks that previously took days or weeks can be resolved within hours or even minutes.
High-throughput methods for genome sequencing, in combination with increased computer power and better algorithms for sequence assembly, have yielded a plethora of genomes accessible for analysis. However, complicated parts of sequenced genomes tend to be left unfinished to a large extent. This is especially the case for eukaryotic genomes, where the majority of the genomes presently sequenced have repeated regions that were left unresolved (see e.g.  and  for discussion on how duplications affect eukaryotic genome projects). Current shotgun sequencing assembly programs are not designed to handle long stretches of repeated DNA in the target sequence, and it is common that repeated sequences are left out of the assembly altogether. In addition, repeats often cause assembly errors, e.g. large artificial rearrangements due to misassembled repeat regions. Also common are assemblies with the repeat copies merged into alignments of high coverage, with reads of the repeat region piled on top of each other. Although many repeats appear to have no discernible biological function, in many cases the repeats play an important role in the biology of the organism , and some organisms have a significant amount of their genes organized into head-to-tail tandem arrays consisting of nearly identical genes. One example is Trypanosoma cruzi, a protozoan parasite with a highly repetitive genome  containing multi-copy gene families such as cruzipain , histone H1  and HSP70 .
The presence of repeated regions in the target sequence is thus the key problem in shotgun sequencing. This is especially true for the whole genome shotgun (WGS) approach that has emerged as the method of choice in recent years. Where the previous clone-by-clone strategies allowed for compartmentalizing the genome and handling of repeat regions locally, the WGS approach requires handling of all copies of a repeat region simultaneously, even if they are spread throughout the genome. The problems caused by repeats can be somewhat reduced by combining the two approaches, but the incidence of repeats remains a key problem and major cause of errors in shotgun sequencing assemblies.
A successful strategy for solving the problem for short repeat regions has been the use of mate pairs . Using mate pairs it is possible to correctly assemble tandem repeat regions or single repeat units dispersed in unique genomic sequence, depending on the order in which the fragments are assembled and providing that a sufficient amount of the sequence reads sampling the repeat copies have mate pairs in the unique regions. However, this strategy fails when nearly identical repeats are organized in tandem stretches longer than twice the shotgun fragment insert length. In this case, the mate pairs of reads sampling repeat units, sample another part of the same repeat region, which makes it impossible for current assembly algorithms to determine the correct layout of the shotgun fragment reads.
These problems of the common assembly methods place a heavy burden on the biologists working on the finishing stage of sequencing projects and add to the bottleneck that finishing constitutes. A number of tools have been developed to aid this process [9–12]. However, these tools, although very useful for non-repeated sequences, are not designed for finishing complex, repeated regions. Generally, a major problem with current finishing tools is that they provide either a close-up view of the shotgun reads in the different contigs of the assembly, or a zoomed out view of the entire genome, with nothing in between. With a rigid close-up view, the user can only view a small portion of the repeat region at a time, and much scrolling is required in order to get a clear understanding of the region, whereas a genome wide view does not allow for manual inspection and correction at the read level. Furthermore, common tools generally lack the flexibility needed to correct obvious errors in a straight forward fashion, often requiring the user to re-run the whole assembly and hope for the reads to end up in the correct positions. Most importantly, although other systems use high quality mismatches between reads in attempt to separate repeats, none of them have adequate specificity and fully utilize the presence of single base differences between repeat copies as a resource in repeat resolution.
We here present DNPTrapper, an assembly editing and visualization tool specifically designed for manual analysis and finishing of repeated regions. It differs from previous tools by providing flexibility and an overview that greatly simplifies the finishing process, by allowing the user to view whole repeat regions at once and to edit assembly errors manually by drag and drop. The program implements and visualizes the results of a previously described statistical method that detects defined nucleotide positions (DNPs, representing single base differences between repeat units) in the presence of sequencing errors .
The source code is available from the authors under an Open Source license.
DNPTrapper is a sequence alignment editing tool developed for finishing complicated shotgun sequencing projects. It combines relatively simple but powerful algorithms with visualization of problematic assembly locations, thus providing detailed information for biologists and allowing rapid decision making to make necessary corrections. The main goal of DNPTrapper is to provide more power to the user than other finishing tools. The user can move sequences using drag and drop; re-align them; cut, copy, and paste them; run algorithms on them; add and remove features; choose between view modes; zoom in and out. The user interface is a front end to a database, and changes made using Trapper are automatically propagated to the database. DNPTrapper is essentially an editor that visualizes assemblies of shotgun sequence fragment reads as gapped multiple alignments. The assemblies can be produced by any assembler that produces the supported file formats (e.g. .ace-files from Phrap), and can be exported to the same format after repeat analysis and resolution. In addition to the read sequences, different features and data such as DNPs, vector sequence, quality values, chromatograms and mate pairs are visualized in the editor according to the preferences of the user. Sequence features can be present in the input assembly files, or be added during the finishing process by running default built-in algorithms that detect and label the desired features.
Together with default built-in algorithms for sorting reads according to a variety of criteria (e.g. DNP content), these features allow for semi-automatic resolution of almost identical repeats in a straight forward fashion, reducing finishing time for such regions from days to a matter of hours or even minutes. Below, the flow of a typical session with DNPTrapper is outlined, with description of some key aspects of program use.
The first step in a DNPTrapper session is to import data into the editor. File formats currently supported are the Phrap .ace format and a native XML format described in the documentation. Other formats will be added; in the meantime there are free converters between different file formats available at the AMOS website . After importing, contigs in the project are available for analysis. In the default visualization mode after opening a contig, sequence reads are represented by black-bordered boxes on white background. When zooming in, the base sequences and quality values, visualized by a grey-scale, become visible. The user can customize the visualization mode of quality values and other features.
Since most assembly programs produce assemblies that are locally non-optimal, the next step is to apply the built-in ReAlign algorithm  to the contigs that appear to contain repeats. This step is crucial for the subsequent application of the DNP algorithm.
When the selected contig has been ReAligned, the DNP algorithm can be run with the desired parameters (see  for details) in order to detect single base differences between repeat copies. These appear as color coded dots in the alignment. The different colors represent different DNP types, where a type is defined by the consensus base and the base of the single base difference. There are thus twelve different DNP types and twelve corresponding colors.
Another strategy is to start by applying a sorting algorithm to the contig before finishing. It sorts the reads into different groups according to their DNP content by picking out a read and locating all other reads sharing DNPs with the original read and adding them to the set. Since the newly found reads may contain additional DNPs, the process is repeated until all the DNPs in the set have been exhausted. However, the DNP detection method may produce false positives, and it is often necessary to perform manual correction of the group assignments after sorting since groups may have been merged together due to erroneous DNP assignments. Still, this simple algorithm performs remarkably well due to the low error rate of the DNP detection method, and errors are easy to resolve using human judgment and other available data such as mate pairs and chromatograms. The most straightforward way to use DNPTrapper is to use the sorting procedure and resolve the remaining ambiguities manually.
After analysis and resolution, the database can be exported as a flat file (currently available formats are .ace and the native XML format), so that the data can re-enter the normal finishing pipeline. There are also other export options that are more suitable if the objective is analysis rather than finishing; arbitrary subsets of sequences, or their consensus sequence, can be extracted and written to file in FASTA format for analysis with other tools.
Great care has been taken to make the system suitable for projects of varying sizes. All data is stored in the database and is read from a disc on request from the GUI. This makes the RAM requirement low, allowing handling of projects of virtually any size. This includes, but is not limited to, mammalian size genomes. The parsers used for importing data into DNPTrapper are event-based and file size is thus not a limiting factor.
Apart from the scalability, flexibility has also been a major design goal. A plug-in system and a well documented Application Program Interface makes adding new algorithms, visualization modes and sequence features simple.
DNPTrapper currently runs under Linux Fedora Core 3 and have been tested on 32 and 64 bit platforms. Ports to other platforms may be carried out in the future.
In order to further illustrate the functionality of DNPTrapper, we have tested it on three data sets, one simulated and two from the T. cruzi whole genome shotgun sequencing project . In all three cases, the reads from each project were assembled with Phrap and subsequently imported into DNPTrapper for further analysis. Phred quality values were not used in the assembly steps, in order to make sure that all reads with sequence similarity ended up aligned to each other. The use of quality values causes Phrap to partially resolve repeated regions, which makes for en incomplete analysis. Instead, the parameter -default_qual was set to 10. Regions of 89% average Phred quality or more in the reads was subjected to DNP analysis.
The simulated data set was included as a proof of concept, to verify the correctness of the implementation of the DNP method and the functionality of the program. The regions from T. cruzi were chosen since they are examples of complicated genomic regions of biological importance (both contain genes), and regions where the assembly program (Celera assembler, ) has failed to assemble the reads correctly. Furthermore, they constitute two different types of tandem repeats that can be resolved using DNPTrapper. These genes have not been characterized previously in T. cruzi. The regions were located by scanning the assembly for regions with unusually high shotgun coverage, and reads matching these regions where extracted by performing sequence similarity searches in the read database.
We simulated shotgun sequencing of a 20 kb template sequence consisting of 10 repeat units of 2 kb in tandem, with a pair-wise sequence difference between any two repeat copies of 2%. The simulation was performed using sim_gun (described in ), an in-house shotgun sequencing simulation program that emulates the sequencing process as closely as possible. It uses quality values from real shotgun projects when constructing reads at random places in the target sequence.
Elongation factor 2 (EF2)
The consensus sequence of the T. cruzi whole genome shotgun assembly contains a region with strong sequence homology to EF2, crucial for the translocation step in eukaryote protein synthesis . The assembly program had assembled 202 shotgun reads covering this region into one pile, unable to separate the repeat copies. The sequence similarity search located an additional 140 reads from this gene that had not been included in the original assembly.
Monoglyceride lipase (MGL)
Another region present in the T. cruzi assembly is homologous to MGL, which is part of the fat digestion pathway . Again, the reads sampling this region had been assembled in one pile consisting of 71 reads. Another 439 reads matching this region, discarded by the assembly program, were also located in the read database.
This repeat region was found to have a different structure than the EF2 region described above. Instead of two large groups corresponding to conserved repeat arrays on each homolog, the reads could be divided into several small groups representing repeat units present in one or two copies each.
Repeated regions remain the key problem in shotgun sequencing. Current assembly algorithms are unable to assemble nearly identical repeats correctly. Repeated parts of the target genome are routinely left out of the assembly altogether, and when they are included, the resulting assemblies are full of errors such as large rearrangements, merged repeat copies and broken scaffolds.
Finishing is the major bottleneck in sequencing projects, and complex, repeated regions are often left unresolved. Despite the fact that repeats are the main reason for assembly errors, commonly used finishing tools, e.g. Consed, are not designed with the repeat problem in mind. The software is very useful for gap closure and other finishing operations on non-repeated sequences. However, when repeats are encountered, current tools lack in flexibility and overview, not allowing the user to correct obvious errors manually and presenting a rigid, too close-up view of the repeat region. Moreover, no current finishing software utilizes single base differences between repeat copies as a tool for repeat separation.
With this in mind, we have developed a shotgun sequencing assembly editor specifically designed to cope with the problems encountered when the target sequence contains nearly identical repeats. By zooming out of a contig, the user gets an overview of its length and depth, and the color coding of DNPs makes it possible to directly see patterns of repeat groups in the data, which allows for rapid manual repeat separation. Instead of only allowing assembly programs to determine the positions of reads, the user is allowed to drag and drop sequences into the positions they see fit. The option of re-assembly is still available, and the number of options for finishing is therefore increased. These three features – DNP method, overview, and flexibility – set this tool apart from other finishing software and introduce a new concept for finishing.
The structure of DNPTrapper makes it possible to add new features. Several possible extensions can already be envisaged. Most important is the addition of an algorithm that uses mate pairs to order the resolved repeat groups. Other improvements include supporting more input file formats, adding more viewable features, and adding a global view of all contigs in a project, with specific algorithms for comparing, ordering and merging contigs. Also, work is in progress to interface DNPTrapper to TIGR's open source assembler AMOS .
Our results show that finishing tasks previously deemed impossible to resolve or very time consuming can be performed in a straight-forward fashion using DNPTrapper. Using this tool, the process of resolving repeat regions that would take days or weeks using current software, instead can be resolved within hours or even minutes. The use of DNPTrapper as a finishing tool reduces finishing times and thus costs. It also allows in-depth studies and characterization of the poorly understood repeat regions present in most genomes.
Availability and requirements
Project name: DNPTrapper
Project home page: http://dnptrapper.sourceforge.net/
Operating system: Linux
Programming language: C++
License: BSD Open Source license
Any restrictions to use by non-academics: No restrictions
The authors wish to thank Erik Sjölund for important work on the GUI and database systems, Staffan Alveteg for implementing ReAligner, Daniel Nilsson and reviewers for insightful comments, and Fatima Farzana for extensive testing. This work was supported by grants from the Swedish Research Council and NIH (U01 AI045061).
- She X, Jiang Z, Clark RA, Liu G, Cheng Z, Tuzun E, Church DM, Sutton G, Halpern AL, Eichler EE: Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 2004, 431(7011):927–30. 10.1038/nature03062View ArticlePubMedGoogle Scholar
- Eichler EE, Clark RA, She X: An assessment of the sequence gaps: unfinished business in a finished human genome. Nat Rev Genet 2004, 5: 345–54. 10.1038/nrg1322View ArticlePubMedGoogle Scholar
- Ji Y, Eichler EE, Schwartz S, Nicholls RD: Structure of chromosomal duplicons and their role in mediating human genomic disorders. Genome Res 2000, 10(5):597–610. 10.1101/gr.10.5.597View ArticlePubMedGoogle Scholar
- El-Sayed NM, Myler PJ, Bartholomeu DC, Nilsson D, Aggarwal DG, Tran AN, Ghedin E, Worthey EA, Delcher AL, Blandin G, Westenberger SJ, Caler E, Cerqueira GC, Branche C, Haas B, Anupama A, Arner E, Åslund L, Attipoe P, Bontempi E, Bringaud F, Burton P, Cadag E, Campbell DA, Carrington M, Crabtree J, Darban H, da Silveira JF, de Jong P, Edwards K, Englund PT, Fazelina G, Feldblyum T, Ferella M, Frasch AC, Gull K, Horn D, Hou L, Huang Y, Kindlund E, Klingbeil M, Kluge S, Koo H, Lacerda D, Levin MJ, Lorenzi H, Louie T, Machado CR, McCulloch R, McKenna A, Mizuno Y, Mottram JC, Nelson S, Ochaya S, Osoegawa K, Pai G, Parsons M, Pentony M, Pettersson U, Pop M, Ramirez JL, Rinta J, Robertson L, Salzberg SL, Sanchez DO, Seyler A, Sharma R, Shetty J, Simpson AJ, Sisk E, Tammi MT, Tarleton R, Teixeira S, Van Aken S, Vogt C, Ward PN, Wickstead B, Wortman J, White O, Fraser CM, Stuart KD, Andersson B: The genome sequence of Trypanosoma cruzi , etiologic agent of Chagas disease. Science 2005, 5733: 409–415. 10.1126/science.1112631View ArticleGoogle Scholar
- Campetella O, Henriksson J, Aslund L, Frasch AC, Pettersson U, Cazzulo JJ: The major cysteine proteinase (cruzipain) from Trypanosoma cruzi is encoded by multiple polymorphic tandemly organized genes located on different chromosomes. Mol Biochem Parasitol 1992, 50: 225–34. 10.1016/0166-6851(92)90219-AView ArticlePubMedGoogle Scholar
- Aslund L, Carlsson L, Henriksson J, Rydaker M, Toro GC, Galanti N, Pettersson U: A gene family encoding heterogeneous histone H1 proteins in Trypanosoma cruzi. Mol Biochem Parasitol 1994, 65: 317–30. 10.1016/0166-6851(94)90082-5View ArticlePubMedGoogle Scholar
- Requena JM, Lopez MC, Jimenez-Ruiz A, de la Torre JC, Alonso C: A head-to-tail tandem organization of hsp70 genes in Trypanosoma cruzi. Nucleic Acids Res 1988, 16: 1393–406.PubMed CentralView ArticlePubMedGoogle Scholar
- Edwards A, Caskey CT: Closure strategies for random DNA sequencing methods. A Companion to Methods in Enzymology 1990, 3: 41–47. 10.1016/S1046-2023(05)80162-8View ArticleGoogle Scholar
- Gordon D, Abajian C, Green P: Consed: A graphical tool for sequence finishing. Genome Res 1998, 8: 195–202.View ArticlePubMedGoogle Scholar
- Staden R, Beal KF, Bonfield JK: The Staden Package, 1998. Methods Mol Biol 2000, 132: 115–130.PubMedGoogle Scholar
- Gordon D, Desmarais C, Green P: Automated finishing with Autofinish. Genome Res 2001, 11: 614–625. 10.1101/gr.171401PubMed CentralView ArticlePubMedGoogle Scholar
- Frangeul L, Glaser P, Rusniok C, Buchrieser C, Duchaud E, Dehoux P, Kunst F: CAAT-Box, contigs-Assembly and Annotation Tool-Box for genome sequencing projects. Bioinformatics 2004, 20: 790–797. 10.1093/bioinformatics/btg490View ArticlePubMedGoogle Scholar
- Tammi MT, Arner E, Britton T, Andersson B: Separation of nearly identical repeats is shotgun assemblies using defined nucleotide positions, DNPs. Bioinformatics 2002, 18: 379–388. 10.1093/bioinformatics/18.3.379View ArticlePubMedGoogle Scholar
- AMOS home page[http://www.tigr.org/software/AMOS/]
- Anson EL, Myers EW: Realigner: a program for refining DNA sequence multi-alignments. J Comp Biol 1997, 4: 369–83.View ArticleGoogle Scholar
- QT home page[http://www.trolltech.com]
- Berkeley DB home page[http://www.sleepycat.com]
- Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC: A whole-genome assembly of Drosophila. Science 2000, 287(5461):2196–204. 10.1126/science.287.5461.2196View ArticlePubMedGoogle Scholar
- Tammi MT, Arner E, Andersson B: TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences. Comput Methods Programs Biomed 2003, 70(1):47–59. 10.1016/S0169-2607(01)00194-8View ArticlePubMedGoogle Scholar
- Arlinghaus R, Shaeffer J, Schweet R: Mechanism of peptide bond formation in polypeptide synthesis. Proc Natl Acad Sci USA 1964, 51: 1291–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Senior JR, Isselbacher KJ: Demonstration of an intestinal monoglyceride lipase: an enzyme with a possible role in the intracellular completion of fat digestion. J Clin Invest 1963, 42: 187–95.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.