Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies
© Wetzel et al; licensee BioMed Central Ltd. 2011
Received: 17 December 2010
Accepted: 13 April 2011
Published: 13 April 2011
Next-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. However, as sequencing technology has improved, the difficulty of de novo genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as mate-pairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Here, we examine the effectiveness of mate-pair information in resolving repeated sequences in the DNA (a paramount issue to overcome). While it has been empirically accepted that mate-pairs improve assemblies, and a variety of assemblers use mate-pairs in the context of repeat resolution, the effectiveness of mate-pairs in this context has not been systematically evaluated in previous literature.
We show that, in high-coverage prokaryotic assemblies, libraries of short mate-pairs (about 4-6 times the read-length) more effectively disambiguate repeat regions than the libraries that are commonly constructed in current genome projects. We also demonstrate that the best assemblies can be obtained by 'tuning' mate-pair libraries to accommodate the specific repeat structure of the genome being assembled - information that can be obtained through an initial assembly using unpaired reads. These results are shown across 360 simulations on 'ideal' prokaryotic data as well as assembly of 8 bacterial genomes using SOAPdenovo. The simulation results provide an upper-bound on the potential value of mate-pairs for resolving repeated sequences in real prokaryotic data sets. The assembly results show that our method of tuning mate-pairs exploits fundamental properties of these genomes, leading to better assemblies even when using an off -the-shelf assembler in the presence of base-call errors.
Our results demonstrate that dramatic improvements in prokaryotic genome assembly quality can be achieved by tuning mate-pair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way sequencing projects are designed. We propose that a two-tiered approach - first generate an assembly of the genome with unpaired reads in order to evaluate the repeat structure of the genome; then generate the mate-pair libraries that provide most information towards the resolution of repeats in the genome being assembled - is not only possible, but likely also more cost-effective as it will significantly reduce downstream manual finishing costs. In future work we intend to address the question of whether this result can be extended to larger eukaryotic genomes, where repeat structure can be quite different.
Next-generation sequencing platforms such as Illumina, ABI Solid, and 454 allow genomes to be sequenced more quickly and at a lower cost than ever before. However, as sequencing technology has improved, the wealth of available data has not made genome assembly easier. In fact, the level of difficulty has increased, as the cost savings provided by new technologies are accompanied by a reduction in achievable read-length. Assembly software is now faced with the task of assembling reads ranging from approximately 35 to 500 nucleotides, in contrast to the 800-2000 nucleotide reads previously generated by the traditional Sanger technology. Additionally, the error characteristics of the newer technologies are not as well known as they were for traditional Sanger sequencing. De novo assemblies (even in the case of prokaryotic genomes) are highly fragmented . In addition, advances in sequencing technologies have not been mirrored by corresponding improvements in finishing - a time-, labor-, and cost-intensive process aimed at reconstructing a complete, gapless, genome sequence from a fragmented assembly. As a result, the majority of genomes remain in an incomplete 'draft' state, hampering studies that rely on long-range genome structure information (e.g., analysis of operon/regulon structure, or large-scale genomic variation).
Genomic repeats - DNA segments repeated in nearly-identical form throughout a genome - are the main reason why genome assemblers cannot automatically reconstruct complete genome sequences from modern sequencing data. Repeats have a number of biological causes, including transposable sequence elements (which can often be highly abundant in a genome), prophages, highly-conserved gene clusters (e.g., the Ribosomal RNA operons in bacteria), or large segmental duplications, and can vary from short simple sequence repeats (e.g., AAAAA, ACACAC) to long stretches of DNA (thousands to tens of thousands of base-pairs) that are highly or completely identical. In the context of inexpensive next-generation sequencing and assembly, some of the original pitfalls of genome assembly, such as coverage gaps and confusion caused by read errors, can often be obviated by simply sequencing the genome to very high coverage levels. Repeated sequences, on the other hand, cannot be avoided by simply over-sequencing, and lead to considerable ambiguity in the reconstruction of a genome, providing limits on the length of the contiguous DNA segments (contigs) that can be correctly reconstructed using unpaired reads .
Graph-theoretic models of genome assembly provide an effective framework for analyzing the impact of repeats on the complexity of genome assembly . The genome assembly problem is commonly formulated as finding a constrained path through an appropriately-defined graph. Throughout this article we will rely on an Eulerian formulation  that reduces the assembly problem to finding a Chinese Postman path/tour (a minimum length path through the graph that covers all edges) within a de Bruijn graph (to be defined later). Under this formulation, repeats appear as forks in the graph that make it difficult to select the graph traversal that corresponds to the correct reconstruction of the genome from among an exponential (in the number of repeats) number of possible Chinese Post-man paths. Without additional information, genome assemblers can only correctly reconstruct relatively short, unordered segments of the genome (corresponding to those sections of the graph which contain no forks).
In addition to the collection of reads, most sequencing technologies also produce pairwise constraints on the placement (approximate distance and relative orientation) of these reads along the genome - mate-pair information. This information can further constrain the possible traversals of the assembly graph, thereby allowing longer segments of the genome to be unambiguously reconstructed. Mate-pair information has been a critical component of most genome projects, starting with Haemophilus influenzae - the first free-living organism to be fully sequenced. Most genome assemblers now include modules that can use mate-pair information for scaffolding and repeat resolution (Celera Assembler , Velvet , Euler , Arachne , and ALLPATHS  to name just a few), and stand-alone scaffolding tools such as Bambus  allow the incorporation of mate-pair data into virtually any assembly . Furthermore, mate-pairs are frequently used for the purpose of assembly validation in tools such as BACCardi , Consed , and Hawkeye .
Most of the research on the use of mate-pair information in genome projects has focused primarily on the use of this information to achieve long-range connectivity by spanning long repeats and gaps in the assembly due to insufficient sequencing coverage. As a result, substantial efforts have been focused on the development of robust protocols for constructing long-range (8-10 kbp or longer) mate-pair libraries. Here, we explore a complementary purpose for mate-pairs - the automatic resolution of repeats during assembly. We argue that, due to affordable high-throughput sequencing technologies, coverage gaps are far less frequent than they used to be, and assembly fragmentation is primarily caused by repeats. Thus improvements in repeat resolution algorithms will translate into substantial improvements in the quality of the resulting assemblies.
Resolving repeats with mate-pairs
All previously published techniques for repeat resolution rely on the same basic observation: If a unique path in the assembly graph can be found that connects the sequences at the ends of a mate-pair and the length of this path matches the approximately known mate-pair size, then this path can be inferred to be correct; i.e., the path represents a partial traversal of the graph that is consistent with the correct reconstruction of the genome being assembled. Since the problem of finding a path of predefined length through a graph is NP-hard , the algorithms used during assembly rely on various heuristics for efficiently finding paths that support mate-pair information.
The EULER assembler  greedily finds paths whose lengths are consistent with the size of mate-pairs (the authors indicate that such paths can be easily found for a majority of mate-pairs), then converts these paths into artificial long reads that can be processed through the Eulerian superpath algorithm . Conflicts between paths through the graph are resolved by prioritizing paths on the basis of their support (number of mate-pairs that confirm a given path) . The Velvet [7, 17] and ALLPATHS  assemblers take into account the uniqueness of nodes in the assembly graph. Specifically, mate-pair links are considered only if they are anchored in contigs that correspond to non-repeated sequences in the genome (usually determined through depth of coverage statistics).
Analyses of the effectiveness of such algorithms have typically assumed the parameters of the sequencing experiment to be fixed; i.e., the goal is to build the best assembly possible given the types of data commonly generated in current sequencing projects. An exception is the study by Chaisson et al.  where the authors evaluate the effect of read length and read quality on the ability to reconstruct a genome. In other words, they assume one can tune the read length generated by a sequencing machine (which is a realistic assumption for the Illumina technology) and estimate whether the assembly improves, and by how much, if the length of reads is increased. Such analyses are critical for providing a scientific basis for picking the optimal trade-off between sequencing cost (which increases with the read length) and quality of assembly.
In our work, we pose a complementary question: How useful are mate-pairs for resolving repeats in de novo assemblies created from short-reads? Which types of mate-pair libraries most effectively resolve repeats and minimize the amount of manual finishing needed to complete the genome (given that there is some restriction on the number of mate-pair libraries that can be cost-effectively produced in a real sequencing experiment)? Similar questions have been addressed by recent publications in the context of sequence alignment. Chikhi et al.  evaluate how read length affects genome resequencing experiments, and Bashir et al.  developed a method to identify the mate-pair sizes that are optimal for detecting structural variation through mapping. However, the question of how effective mate-pairs are for the resolution of repeats and how this parameter of the sequencing experiment can be adjusted to improve assemblies has not previously been analyzed in a systematic fashion.
Measuring assembler performance
Metrics commonly used for comparing the quality of genome assemblies are primarily focused on statistics derived from the global distribution of contig sizes. Statistics such as the number of contigs, average contig size, and N50 contig size are frequently reported in the literature. (N50 contig size is the size c for which 50% of the bases in the genome are contained in contigs of size at least c.)
When the correct answer is known (e.g., reassembly of a known genome), one can also record the number of errors found in the assembly. An alternative approach is proposed by Chaisson et al.  where they compare the results of an assembly to a theoretical optimal: the best assembly that can be reconstructed without errors from a genome, given the repeat graph (see Methods) of that genome.
In our study we are specifically targeting this theoretical optimum: in an idealized setting (perfect sequencing data), what is the best possible assembly that can be obtained given the parameters of the sequencing process? While aiming for this theoretical optimum, we also maintain a dose of reality about certain aspects of sequencing projects, restricting ourselves to the use of only two mate-pair libraries (a fairly standard procedure due to costs) and use of read-lengths that are reflective of inexpensive next-generation sequencing. Like Chaisson et al.  and Kingsford et al. , we start with the idealized repeat graph of a genome (see Methods), the structure of which is determined by the read length. Although we are attempting to maximize the size of contigs that can be unambiguously reconstructed from this graph, contig size statistics are difficult to compare across genomes and may not adequately describe the amount of repeat resolution that a set of mate-pairs provide. Therefore, we focus instead on a measure of assembly ambiguity that is directly related to unresolved repeats. Specifically, we measure the number of manual experiments that would be necessary to completely resolve the structure of a genome during finishing. Briefly, a path through the repeat graph can be uniquely determined by pairing up the edges adjacent to repeat nodes. This pairing is commonly determined during finishing through targeted PCR experiments.
We measure the usefulness of a mate-pair library in terms of the amount of manual finishing effort that can be saved through its use (see Methods for details). We show that, in the context of high-depth prokaryotic sequencing experiments, very short mate-pairs are more useful than long mate-pairs (sizes as high as 40,000 bp are often generated in sequencing projects) for resolving repeats, and that choosing mate-pair sizes based on the repeat structure (defined in Methods) of the assembly graph is a powerful approach for creating more complete assemblies. Our results hold across 360 'ideal' sequencing experiments (see following section) as well as when using an off-the-shelf assembler in the presence of base-call errors (assembly of 8 bacterial genomes using SOAPdenovo ). These results can serve as a basis for developing new algorithms and sequencing strategies that will improve de novo assemblies. Due to computational efficiency considerations we limited our analysis to prokaryotic genomes. Whether or not our results can be extended to eukaryotes (whose repeat structure is often quite different from prokaryotes) remains an exercise for future work.
We simulated ideal sequencing projects (no gaps, no errors, known mate-pair sizes) for 391 complete bacterial genomes using a range of read lengths and mate-pair sizes. Read lengths were chosen to be reflective of reads that can be affordably obtained using next-generation sequencing platforms. In addition, for 360 of these genomes, we performed a direct comparison of the amount of repeat resolution that can be obtained when applying information provided by long mate-pairs (libraries between several thousands to tens of thousands of basepairs are commonly generated in current sequencing experiments) vs. mate-pairs whose sizes are 'tuned' to the repeat structure of the genomes being assembled. Finally, we show that the results of our simulated comparison are recapitulated when assembling 8 bacterial genomes in the presence of base-call errors using an off -the-shelf assembler (SOAPdenovo).
We assume a refined version of the repeat graph for a genome; specifically, the graph that can be obtained through a series of lossless transformations on the original de Bruijn graph (see Methods). Since all of the transformations used simply convert unambiguous paths from the original graph into single nodes, the resulting simplified structure is equivalent (in terms of the spectrum of possible graph traversals) to the original graph. The simplified version of the graph is more compact and better highlights the complexity introduced by repeats, since every node in the simplified graph is either a repeat itself or is one of a set of possible nodes to traverse in between a pair of repeats. (For more information on this graph refinement process, see Methods and a more in depth description in an article by Kingsford et al. .)
Throughout most of our analysis, we consider k-mer size (the size of sequences used in initial construction of the de Bruijn graph) and read-length to be interchangeable, denoting both with the letter k. Although read-length and k-mer size are not typically the same in a real sequencing and assembly experiment (in reality longer reads are decomposed into shorter k-mers), we consider the distinction to be irrelevant throughout much of the discussion of our simulations; the only exception being the discussion of SOAPdenovo assemblies. Since our 'idealized' graphs (see Methods) are constructed from complete genomes decomposed directly into k-mers, we implicitly imagine that we have perfect reads of length k and that they can be used in their entirety during graph construction. Since a read of length k can provide a k-mer of length at most k, ignoring the distinction still provides a valid (if perhaps lenient) upper-bound on what one can expect to achieve if using reads of length k with a real assembler.
Our results begin with some theoretical analysis of the usefulness and limitations of mate-pairs and then proceed to an empirical investigation into the benefits of 'tuning' mate-pair libraries to target repeats of high complexity. It should be noted here that due to limitations of our current implementation to handle larger, eukaryotic de Bruijn graphs, we have restricted the scope of our results to cover only assembly of prokaryotic genomes. We intend to explore whether or not these results can be extended to eukaryotic assemblies in future work. A complete set of data files produced for this experiment is available at: http://www.cbcb.umd.edu/~wetzeljo/matePairs/.
Performance of shortest-path heuristic
More complete mate-pair statistics
As a corollary to the above observation, the value of mate-pairs is maximized only once sufficient coverage is achieved to ensure that virtually all resolvable repeats are adequately spanned. The necessary number of mate-pairs is a function of the genome size G and the k-mer length k used for graph construction (roughly 10G/k; see Methods for details).
'Localized complexity' is common in short-read assembly graphs
In order to estimate the extent of genomic complexity introduced by the bubble pattern described above, we define a measure of the localized complexity of a genome as follows. For each repeat node, v ∈ G, we define the node to be 'trivial' if all of its successor nodes have distinct lengths, and 'non-trivial' otherwise. The motivation for this nomenclature is that if v has multiple successors of the same length, it is likely that v fits the description of repeat R1 in the above stated example of an assembly bubble, and will be difficult to resolve without targeting at least one mate-pair to barely span its length. On the other hand, if all successors of v are of different lengths, v cannot fit the description of R1 in the above example, and the proper traversal order of v can likely be resolved using mate-pairs of arbitrary (longer) length. Thus we define the localized complexity of a genome, C-Statistic, to be the percentage of the finishing complexity of the genome contained in non-trivial nodes (C-Stat(G) = 100 , where N is the set of non-trivial nodes in the assembly graph, S is the set of all nodes in the assembly graph, and C v is the contribution of node v to the finishing complexity of the genome (see Methods for details on finishing complexity).
The existence of bubbles in assembly graphs has been noted before, and several approaches have been suggested for resolving such regions (see, e.g., Chaisson et al. ). Our results imply that any such methods will be of limited use unless the mate-pair libraries are tuned to match the repeat structure of the genome being assembled.
'Ideal' mate-pairs are short
We further explored the hypothesis that 'tuning' mate-pair libraries to accommodate a genome's specific repeat structure could lead to higher quality assemblies. All repeats in a genome were grouped according to size into bins of width 4k, where k is the read length. For each bin we calculated the fraction of the total finishing complexity (see Methods) of the genome that is due to the repeats from that bin. We then selected the two bins with the highest finishing complexities, and constructed mate-pair libraries that just span repeats in these bins by using inserts that were 3k longer than the average repeat size for each bin. We restrict our analysis to just two libraries for any given genome to match the setting commonly encountered in practice.
Mate-pairs selected in this fashion end up being roughly proportional in size to the read length, k, averaging from 4.5k to 6k. This is significantly shorter than commonly constructed long-range libraries (6,000 bp or longer).
These results are similar in spirit to, and complement, the EULER-PCR algorithm proposed by Mulyukov et al. . In EULER-PCR, the assembler uses information about the repeat length (along with the known sequences on either side of the repeat) in order to minimize the number of primers and multiplex PCR experiments needed to resolve tangles in the graph. Essentially, the assembler is leveraging the repeat structure of the graph to find optimal sets of short mate-pairs to reduce manual finishing. In the context of producing mate-pair libraries via sequencing, we cannot target individual repeats, so we instead attempt to find the mate-pair sizes that are most likely to resolve a large fraction of a genome's repeats.
'Tuned' mate-pair libraries perform well in a simulated setting
To evaluate the effectiveness of the 'tuned' mate-pair libraries just described, we analyzed through simulations 360 bacterial genomes (each marked by an * in Additional file 1) across 5 different read lengths. (Several of the 391 genomes we originally analyzed required prohibitive computational resources and these genomes were excluded from the current analysis.) For each genome we compared the performance of the 'tuned' mate-pair libraries to a mixture of 2,000 and 8,000 bp libraries. The latter mixture deserves a brief explanation: the majority of genome projects (irrespective of sequencing technology, see e.g. [23, 24] and jgi.doe.gov) rely on a mixture of library sizes, one of which is relatively long (possibly exceeding 10,000 bp and even as long as 40,000-50,000 bp in the case of fosmid libraries). As we cannot feasibly test all possible combinations of library sizes, we chose a combination that is reasonable and roughly matches the 'typical' scenario used in Sanger sequencing projects. In both cases (tuned libraries or long libraries) the depth of coverage was the same, and was chosen to ensure that all repeats can be adequately spanned (see Methods). For each genome we recorded the finishing complexity (see Methods) before and after the incorporation of mate-pair information, as well as the localized complexity (C-Statistic) of the genome.
For a combination of two long mate-pair libraries, one can see a clear inverse relationship between reduction in finishing complexity and C-Statistic (two graphs on right in Figure 4). However, this relationship is greatly diminished for the tuned libraries (two graphs on left in Figure 4). This result is consistent with the observation that tuned library sizes are longer for the low complexity genomes, where chains of trivial repeats can often be resolved simultaneously by a single mate-pair.
While the results of simulations on 250-mer graphs were very similar to the results of simulations on the shorter-read graphs (tuned mate-pairs outperformed long mate-pairs), the performance improvement over our 'standard' library mixture decreases for 500-mer graphs (see bottom two graphs in Figure 4). This can be attributed to the fact that the standard insert size of 2000 bp is short (exactly 4k) with respect to the original read length for the 500-mer graphs, while it is long (exactly 8k) with respect to the original read length of the 250-mer graphs. This result indicates that the need for tuning mate-pair lengths to the genome structure may be unique to the use of very short reads.
'Tuned' mate-pair libraries improve performance when using off-the-shelf assembly software
Genome-specific mate-pair lengths used in SOAPdenovo assemblies
Results of SOAPdenovo assemblies
There are a few difficulties we encounter when attempting to apply the exact quantitative analysis we describe above (and in Methods) to real assembly graphs. First, while our ideal graphs are unambiguously directed, real assembly graphs are inherently bi-directed since DNA is double-stranded. Therefore, it is not possible to directly transform graphs created from real data into unambiguously directed graphs so that the exact finishing complexity (by our definition) can be computed. Specifically, our quantitative analysis requires that we know the in-degree and out-degree of a particular node, yet the direction in which the edges are traversed in a real graph is often not known until the assembly process. Additionally, since edge multiplicities in real sequencing experiments must be estimated based on various metrics such as depth of coverage and errors occur during sequencing, real assembly graphs are often disconnected and certainly not fully Eulerian, further limiting the exact analysis described.
Nonetheless, the strategy for choosing the 'tuned' mate-pair sizes should be applicable to real genome projects. Note that our strategy for tuning the library sizes requires information about the amount of genomic complexity implied by a particular set of repeats of similar size, i.e. it is not sufficient to simply estimate the number and size of repeats from a genome assembly. Instead, the assembly graph (commonly output by many modern assemblers) needs to be analyzed in more detail. As we already mentioned, real assembly graphs have a different structure than the graphs used in our simulation. In order to accommodate the non-Eulerian nature of real assembly graphs, the methods we proposed can be converted to a Chinese Postman traversal setting, i.e., find a minimum-length tour of a non-Eulerian graph that covers each edge at least once. An efficient algorithm for solving this problem on bi-directed de Bruijn graphs has been recently published . The resulting Chinese Postman tour implies an underlying Eulerian graph (which is implicitly constructed while solving the traversal problem) within which the algorithms described in our work can be applied. Specifically, within this implied graph we can compute, for each repeat, the corresponding finishing complexity which, together with the repeat size (already known from the original assembly graph), can be used as outlined in the Results (subsection titled "Ideal mate-pairs are short") to heuristically determine appropriate mate-pair sizes. In future work we plan to build a software package that performs this analysis for commonly used genome assemblers.
Our results demonstrate that dramatic improvements in the quality of prokaryotic genome assemblies can be achieved by tuning mate-pair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way that such sequencing projects are designed. In many sequencing projects, library sizes are chosen based on the ease with which certain size libraries can be effectively constructed in the lab. Virtually always, the mate-pair libraries are constructed at the beginning of the project, before having an opportunity to evaluate the repeat structure of the genome being sequenced. This 'one-step' approach was unavoidable in the case of Sanger sequencing where mate-pairs were a by-product of the sequencing process (sequencing would ultimately cost the same whether or not mate-pairs are generated).
In the case of next-generation sequencing technologies, however, the protocols were originally developed for creating unpaired reads, and the generation of mate-pairs generally increases the costs (in terms of prep time, reagents, and machine runtime) of sequencing. In this context, the two-tiered approach we propose - first generate an assembly of the genome with unpaired reads and evaluate the repeat structure of the genome; then generate the mate-pair libraries that provide most information towards the resolution of repeats in the genome being assembled - is not only possible, but likely more cost-effective in the long run. The tuned mate-pair libraries produced by this process will be more usable by the assembler, leading to better assemblies (longer correct contigs), which will dramatically reduce the cost of downstream manual finishing.
Across all of the genomes examined, as read lengths get shorter, the overall efficacy of mate-pairs in reducing finishing complexity is almost always diminished regardless of the amount of localized complexity present in the genome. This result is supported by the empirical observation that genome assemblies constructed from short-read data (e.g. ABI Solid or Illumina) are typically substantially more fragmented than those constructed with longer reads (454 or Sanger) , further underscoring the need to develop affordable sequencing technologies that generate long reads.
Many advances have been made in the field of automatic genome assembly in the past several years. Here we have shown that a novel, two-tiered approach to sequencing projects (along with the development of technologies which can reliably and affordably produce mate-pairs of a desired size) can greatly improve automation. However, even under ideal circumstances (idealized graphs, error-free reads, perfect coverage, and tuned mate-pair sizes), we still found ourselves unable to produce completely gapless assemblies in many cases. On average, approximately 17% of the original finishing complexity of a given genome still remained after applying mate-pairs. Thus, although the automation process continues to improve, the development of high-throughput and cost-effective approaches for genome finishing will also be of great importance in the years to come.
Idealized repeat graphs
A commonly used formulation of the assembly problem, based on de Bruijn graphs, was introduced by Pevzner et al. . Specifically, a de Bruijn graph of order k is a graph that contains a node for every (k - 1)-length string (referred to as a (k - 1)-mer) over a given alphabet, and that contains an edge for every pair of (k - 1)-mers that overlap by exactly (k - 2) letters (the two nodes and the edge thus represent one of all possible k-mers). In the context of assembly, we focus on the sub-graph of the de Bruijn graph that corresponds to the k-mers that are actually present in the genome being assembled (for simplicity we will refer to this sub-graph as the de Bruijn graph as well).
We constructed de Bruijn graphs from the complete genome sequences of 391 prokaryotes (obtained from ftp://ftp.ncbi.nlm.nih.gov; see Additional file 1 for genome identifiers) as follows. We first chose a k-mer size, k (values 35, 50, 100, 250, 500 bp were used to be representative of read-lengths achievable through a range of modern sequencing technologies). For each (k - 1)-mer present in the genome, a node was created (without repetition of nodes for repeated (k-1)-mers). For each k-mer present in the genome, a directed edge was created connecting the node representing its first (k - 1) letters to the node representing its last (k - 1) letters. If a k-mer was repeated in the genome, then a number of edges equal to the number of instances of the k-mer were created. Thus we created 'ideal graphs' that are representative of perfect sequencing experiments (exactly one read for every k-length substring in the genome and no sequencing errors). In a real sequencing experiment, due to the double-stranded nature of DNA, each read and its reverse complement must be considered, resulting in a bi-directed graph. For the sake of simplicity, our graphs were constructed using only reads in the forward direction.
A series of lossless simplifications were applied to the resulting de Bruijn graphs as described by Kingsford et al. . The first is a standard path compression in which if there exists a pair of nodes u and v, where u is the only predecessor of v, and v is the only successor of u, the nodes are combined into a single node with the same predecessors as u, and the same successors as v. The new node is labeled with the concatenated sequences of nodes u and v (accounting for the overlap between these sequences). A second type of compression is based on the idea that all areas of the de Bruijn graph whose cycle decomposition is a tree has a unique Eulerian tour, and can therefore be collapsed into a single node representing the sequence given by this traversal. The third technique iteratively 'unzips' half-decision nodes (those nodes with only one predecessor but multiple successors, or vice versa) by duplicating them and then performing standard path compression on the resulting unambiguous areas of the graph traversal.
We simulated mate-pairs along the original genome sequence using the following procedure. Given a library size l, we chose a mate-pair specific size d at random from a Gaussian distribution with mean l and standard deviation 0.1l. A starting point s along the genome was selected uniformly at random, and two reads of length k (same as the value used in constructing the graph) were selected starting from positions s and s + d - k. The reads were identified within the graph through a simple look-up, and the corresponding nodes were paired. In simulations involving mate-pairs of varying lengths, an equal number of mate-pairs of each length was used.
Throughout the analysis we assume to know the exact length d for each mate-pair, while in a practical setting only an approximate distance between mated-reads is known. Since there may be many paths of approximately the distance implied by a mate-pair, using our assumption can eliminate uncertainty as to whether a mate-pair implies a truly unambiguous path. Thus choosing not to model the uncertainty of mate-pair sizes allowed us to focus on understanding the effect of insert size on repeat resolution, rather than becoming stymied by ambiguity or resorting to heuristics aimed at choosing the 'most-likely' of potentially many paths implied by the mate-pair. In practice, this approach also allows us to use a great many mate-pairs in our simulation that might not have been usable by a real assembler. Although we do not model uncertainty about mate-pair sizes, we do take into account the variability in mate-pair sizes within one library that occurs during real mate-pair creation by choosing the mate-pair size from a Gaussian distribution (as described above).
Repeat resolution through mate-pairs
We map each end of a mate-pair to a specific node in the graph. Because the graphs and mate-pairs are constructed from the same known sequence, the coordinates of each node in the assembly graph with respect to the reference genome is known, and this information is used to determine the placement of the mated reads within the graph. In a practical setting, this placement of mated reads can be determined by finding all good alignments (or some subset of the best of these alignments) of the mated sequences to nodes in the graph, then choosing a placement that most closely reflects the length of the insert based on some path-finding heuristic.
We check the graph for a shortest path that both connects the endpoints of a mate-pair and is consistent with the exact insert length. If a mate-pair cannot be mapped to a shortest path in the graph it is eliminated from further consideration. Mate-pairs whose ends map to a same node were excluded from further analysis as they can clearly provide no additional connectivity information. Similarly, we excluded from further analysis those mate-pairs whose ends mapped to two adjacent nodes in the graph. Since such mate-pairs do not span repeats, we see no clear way to use them for repeat resolution. Additionally, if more than one shortest path connects the ends of a mate-pair, there is ambiguity as to which path is the correct one, so the mate-pair is not used. Essentially, only mate-pairs connected by a unique shortest path that is exactly consistent with the known insert length are used for repeat resolution.
Mate-pairs fitting the above criteria are considered useful for disambiguating repeats. Specifically, these mate-pairs indicate that the Eulerian tour through the graph that is consistent with the correct reconstruction of the genome must contain the shortest path connecting their endpoints, thereby eliminating the uncertainty introduced by decision nodes (forks) along these paths.
More sophisticated heuristics for using mate-pair information could be applied here (several have been previously proposed as outlined in the Background section). We rely on a shortest path heuristic in order to ensure computational tractability since, in the general case, finding a path of a given length within a graph is NP-hard . While this heuristic may fail in certain circumstances (i.e. a unique path, consistent with the mate-pair length, exists between the two ends yet it is longer than the shortest path), our results show that, in practice, the vast majority of mate-pairs have lengths consistent with a shortest path between the mated sequences (see Results). Thus more sophisticated heuristics should only have a limited impact. In addition, similar shortest path approaches have been previously used in genome assembly (see, e.g. Medvedev et al. ).
Finishing complexity of the graph
In order to quantitatively characterize the benefit of mate-pair information for repeat resolution, we define a new measure of assembly complexity. Our measure, which we call 'finishing complexity', estimates the number of manual fishing experiments that would be required in order to correctly resolve all repeats and reconstruct a complete and correct version of a genome given the current state of the assembly graph. By using such a measure we assume that the primary goal of the assembly process is to automatically reconstruct as complete a picture of the genome sequence as possible. This measure may not directly apply to other contexts; e.g., in a project that only attempts to correctly reconstruct the genes, rather than the full sequence, of an organism.
A straightforward strategy for manually resolving a repeat is to choose an arbitrary in-edge of the repeat and try pairing it with each possible out-edge using targeted experiments (e.g. through targeted PCR) until the correct pairing is found, at which point both the in-degree and out-degree of the repeat are effectively reduced by 1. This process is then repeated for the remaining edges. The last pairing is implicit, as only one in-edge and one out-edge remain. Thus we define a node's finishing complexity to be , where a can be either the in-degree or out-degree of the node. Since our graphs are Eulerian, the in-degree and out-degree of any particular node are always equal. The total finishing complexity of the graph is simply the sum of the finishing complexities of all nodes.
Determining the optimal number of mate-pairs
Our results on 'localized complexity' (see Results) indicate that it is frequently necessary to target mate-pairs to tightly span repeats of a particular size if we wish to produce mate-pairs that can imply truly unambiguous paths. However, simply picking an insert size that can potentially span repeats of this size is not sufficient to ensure that (when mate-pairs are randomly generated) all such repeats will be disambiguated by the resulting library. It is also necessary to generate sufficient coverage such that each instance of every repeat of the targeted size is actually spanned by at least one mate-pair. This is especially important if very short mate-pairs are being used.
Thus if we create 10(G/k) mate-pairs, we virtually guarantee at least one mate-pair arrival across any stretch of k nucleotides in the genome. Since no nodes in the graph are shorter than k, there should be at least one mate-pair starting in each node that directly leads into a repeat (with at least one additional mate-pair for each additional traversal of the node), and the determination of 'ideal' insert size (see Results) should guarantee that the other end of the mate-pair is just on the other side of the repeat (for the repeats of the size being targeted).
For each of 8 bacterial genomes to be assembled (see Table 3), 30× coverage in 100 bp reads were created using MetaSim v. 0.9.1 with an empirical error model derived from the 80 bp model provided by MetaSim (assuming the last 21 bp have the same error characteristics). Several sets of mate-pairs were constructed for the various assemblies, chosen to be reflective of either standard libraries, 'ideal' libraries (see Results) for a 35-mer graph, or ideal libraries for a 100-mer graph. The mate-pair lengths were distributed normally around the mean with a standard deviation of 10%. SOAPdenovo v. 1.04 was run with option -K31 (the largest k-mer size allowed by SOAPdenovo) and configuration options reverse_seq = 0 (standard mate-pair orientation), asm_flags = 3 (try harder to build large contigs), pair_num_cutoff = 2 (minimum number of mate-pairs required to link contigs), map len = 60 (minimum length match between a read and a contig during scaffolding).
Funding This work was partially supported by NSF grant IIS-0812111 to MP and CK, NSF grant 0849899 to CK, and NSF Award CCF-0830569 to Rajiv Gandhi.
- Alkan C, Sajjadian S, Eichler EE: Limitations of next-generation genome sequence assembly. Nat Meth 2011, 8: 61–65. 10.1038/nmeth.1527View ArticleGoogle Scholar
- Kingsford C, Schatz M, Pop M: Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 2010, 11: 21. 10.1186/1471-2105-11-21PubMed CentralView ArticlePubMedGoogle Scholar
- Pevzner P, Tang H: Fragment assembly with double-barreled data. Bioinformatics 2001, 17(suppl 1):S225–233.View ArticlePubMedGoogle Scholar
- Pevzner P, Tang H, Waterman M: An Eulerian Path Approach to DNA Fragment Assembly. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(17):9748–9753. 10.1073/pnas.171285098PubMed CentralView ArticlePubMedGoogle Scholar
- Fleischmann R, Adams M, White O, Clayton R, Kirkness E, Kerlavage A, Bult C, Tomb J, Dougherty B, Merrick J, McKenney K, Sutton G, Fitzhugh W, Fields C, Gocyne J, Scott J, Shirley R, Liu L, Glodek A, Kelley J, Jenny M, Weidman J, Phillips C, Spriggs T, Hedblom E, Cotton M, Utterback T, Hanna M, Nguyen D, Saudek D, Brandon R, Fine L, Fritchman J, Fuhrmann J, Geoghagen N, Gnehm C, McDonald L, Small K, Fraser C, Smith H, Venter J: Whole-genome Random Sequencing and Assembly of Haemophilus influenzae Rd. Science 1995, 269(5223):496–512. 10.1126/science.7542800View ArticlePubMedGoogle Scholar
- Myers E, Sutton G, Delcher A, Dew I, Fasulo D, Flanigan M, Kravitz S, Mobarry C, Reinert K, Remington K, Anson E, Bolanos R, Chou H, Jordan C, Halpern A, Lonardi S, Beasley E, Brandon R, Chen L, Dunn P, Lai Z, Liang Y, Nusskern D, Zhan M, Zhang Q, Zheng X, Rubin G, Adams M, Venter J: A Whole Genome Assembly of Drosophila . Science 2000, 287(5461):2196–2204. 10.1126/science.287.5461.2196View ArticlePubMedGoogle Scholar
- Zerbino D, Birney E: Velvet: Algorithms for de Novo short read assembly using de Bruijn graphs. Genome Research 2008, 18(5):821–829. 10.1101/gr.074492.107PubMed CentralView ArticlePubMedGoogle Scholar
- Batzoglou S, Jaffe D, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov J, Lander E: ARACHNE: a whole genome shotgun assembler. Genome Research 2002, 12: 177–189. 10.1101/gr.208902PubMed CentralView ArticlePubMedGoogle Scholar
- Butler J, MacCallum I, Kleber M, Belmonte ISM, Lander E, Nusbaum C, Jaffe D: ALLPATHS: De Novo assembly of whole-genome shotgun microreads. Genome Research 2008, 18(5):810–820. 10.1101/gr.7337908PubMed CentralView ArticlePubMedGoogle Scholar
- Pop M, Kosack D, Salzberg S: Hierarchical scaffolding with Bambus. Genome Research 2004, 14: 149–159. 10.1101/gr.1536204PubMed CentralView ArticlePubMedGoogle Scholar
- Pop M: Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 2009, 10(4):354–366. 10.1093/bib/bbp026PubMed CentralView ArticlePubMedGoogle Scholar
- Bartels D, Kespohl S, Albaum S, Drüke T, Goesmann A, Herold J, Kaiser O, Püler A, Pfeiffer F, Raddatz G, Stoye J, Meyer F, Schuster S: BACCardi-a tool for the validation of genomic assemblies, assisting genome finishing and intergenome comparison. Bioinformatics 2005, 21: 853–859. 10.1093/bioinformatics/bti091View ArticlePubMedGoogle Scholar
- Gordon D, Abaijian C, Green P: Consed: A graphical tool for sequence finishing. Genome Research 1998, 8: 195–202.View ArticlePubMedGoogle Scholar
- Schatz M, Phillipy A, Schneiderman B, Salzberg S: Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology 2007, 8: R34. 10.1186/gb-2007-8-3-r34PubMed CentralView ArticlePubMedGoogle Scholar
- Nykänen M, Ukkonen E: The exact path length problem. J Algorithms 2002, 42: 41–53.View ArticleGoogle Scholar
- Chaisson M, Brinza D, Pevzner P: De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Research 2009, 19: 336–346. 10.1101/gr.079053.108PubMed CentralView ArticlePubMedGoogle Scholar
- Zerbino D, McEwen G, Margulies E, Birney E: Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler. PLoS one 2009, 4(12):e8407. 10.1371/journal.pone.0008407PubMed CentralView ArticlePubMedGoogle Scholar
- Chikhi R, Lavenier D: Paired-end read length lower bounds for genome re-sequencing. BMC Bioinformatics 2009, 10(Suppl 13):O2. 10.1186/1471-2105-10-S13-O2PubMed CentralView ArticleGoogle Scholar
- Bashir A, Bansal V, Bafna V: Designing deep sequencing experiments: structural variation, haplotype assembly, and transcript abundance. BMC Genomics 2010, 11: 385. 10.1186/1471-2164-11-385PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short read sequencing. Genome Research 2010, 20(2):265–272. 10.1101/gr.097261.109PubMed CentralView ArticlePubMedGoogle Scholar
- Medvedev P, Brudno M: Maximum likelihood genome assembly. Journal of Computational Biology 2009, 16(8):1101–1116. 10.1089/cmb.2009.0047PubMed CentralView ArticlePubMedGoogle Scholar
- Mulyukov Z, Pevzner P: EULER-PCR: Finishing Experiments for Repeat Resolution. Pacific Symposium on Biocomputing 2002, 199–210.Google Scholar
- Göker M, Held B, Lucas S, Nolan M, Yasawong M, Rio TD, Tice H, Cheng J, Bruce D, Detter J, Tapia R, Han C, Goodwin L, Pitluck S, Liolios K, Ivanova N, Mavromatis K, Mikhailova N, Pati A, Chen A, Palaniappan K, Land M, Hauser L, Chang Y, Jeffries C, Rohde M, Sikorski J, Pukall R, Woyke T, Bristow J, Eisen J, Markowitz V, Hugenholtz P, Kyrpides N, Klenk H, Lapidus A: Complete genome sequence of Olsenella uli type strain (VPI D76D-27CT). Standards in Genomic Sciences 2010., 3:Google Scholar
- Wirth R, Sikorski J, Brambilla E, Misra M, Lapidus A, Copeland A, Nolan M, Lucas S, Chen F, Tice H, Cheng J, Han C, Detter J, Tapia R, Bruce D, Goodwin L, Pitluck S, Pati A, Anderson I, Ivanova N, Mavromatis K, Mikhailova N, Chen A, Palaniappan K, Bilek Y, Hader T, Land M, Hauser L, Chang Y, Jeffries C, Tindall B, Rohde M, Göker M, Bristow J, Eisen J, Markowitz V, Hugenholtz P, Kyrpides N, Klenk H: Complete genome sequence of Thermocrinis albus type strain (HI 11/12T). Standards in Genomic Sciences 2010, 2(2):194. 10.4056/sigs.761490PubMed CentralView ArticlePubMedGoogle Scholar
- Kundeti V, Rajasekaran S, Dinh H: An Efficient Algorithm For Chinese Postman Walk on Bi-directed de Bruijn Graphs. CoRR 2010, abs/1006.4828.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.