Next-generation sequencing platforms such as Illumina, ABI Solid, and 454 allow genomes to be sequenced more quickly and at a lower cost than ever before. However, as sequencing technology has improved, the wealth of available data has not made genome assembly easier. In fact, the level of difficulty has increased, as the cost savings provided by new technologies are accompanied by a reduction in achievable read-length. Assembly software is now faced with the task of assembling reads ranging from approximately 35 to 500 nucleotides, in contrast to the 800-2000 nucleotide reads previously generated by the traditional Sanger technology. Additionally, the error characteristics of the newer technologies are not as well known as they were for traditional Sanger sequencing. De novo assemblies (even in the case of prokaryotic genomes) are highly fragmented . In addition, advances in sequencing technologies have not been mirrored by corresponding improvements in finishing - a time-, labor-, and cost-intensive process aimed at reconstructing a complete, gapless, genome sequence from a fragmented assembly. As a result, the majority of genomes remain in an incomplete 'draft' state, hampering studies that rely on long-range genome structure information (e.g., analysis of operon/regulon structure, or large-scale genomic variation).
Genomic repeats - DNA segments repeated in nearly-identical form throughout a genome - are the main reason why genome assemblers cannot automatically reconstruct complete genome sequences from modern sequencing data. Repeats have a number of biological causes, including transposable sequence elements (which can often be highly abundant in a genome), prophages, highly-conserved gene clusters (e.g., the Ribosomal RNA operons in bacteria), or large segmental duplications, and can vary from short simple sequence repeats (e.g., AAAAA, ACACAC) to long stretches of DNA (thousands to tens of thousands of base-pairs) that are highly or completely identical. In the context of inexpensive next-generation sequencing and assembly, some of the original pitfalls of genome assembly, such as coverage gaps and confusion caused by read errors, can often be obviated by simply sequencing the genome to very high coverage levels. Repeated sequences, on the other hand, cannot be avoided by simply over-sequencing, and lead to considerable ambiguity in the reconstruction of a genome, providing limits on the length of the contiguous DNA segments (contigs) that can be correctly reconstructed using unpaired reads .
Graph-theoretic models of genome assembly provide an effective framework for analyzing the impact of repeats on the complexity of genome assembly . The genome assembly problem is commonly formulated as finding a constrained path through an appropriately-defined graph. Throughout this article we will rely on an Eulerian formulation  that reduces the assembly problem to finding a Chinese Postman path/tour (a minimum length path through the graph that covers all edges) within a de Bruijn graph (to be defined later). Under this formulation, repeats appear as forks in the graph that make it difficult to select the graph traversal that corresponds to the correct reconstruction of the genome from among an exponential (in the number of repeats) number of possible Chinese Post-man paths. Without additional information, genome assemblers can only correctly reconstruct relatively short, unordered segments of the genome (corresponding to those sections of the graph which contain no forks).
In addition to the collection of reads, most sequencing technologies also produce pairwise constraints on the placement (approximate distance and relative orientation) of these reads along the genome - mate-pair information. This information can further constrain the possible traversals of the assembly graph, thereby allowing longer segments of the genome to be unambiguously reconstructed. Mate-pair information has been a critical component of most genome projects, starting with Haemophilus influenzae  - the first free-living organism to be fully sequenced. Most genome assemblers now include modules that can use mate-pair information for scaffolding and repeat resolution (Celera Assembler , Velvet , Euler , Arachne , and ALLPATHS  to name just a few), and stand-alone scaffolding tools such as Bambus  allow the incorporation of mate-pair data into virtually any assembly . Furthermore, mate-pairs are frequently used for the purpose of assembly validation in tools such as BACCardi , Consed , and Hawkeye .
Most of the research on the use of mate-pair information in genome projects has focused primarily on the use of this information to achieve long-range connectivity by spanning long repeats and gaps in the assembly due to insufficient sequencing coverage. As a result, substantial efforts have been focused on the development of robust protocols for constructing long-range (8-10 kbp or longer) mate-pair libraries. Here, we explore a complementary purpose for mate-pairs - the automatic resolution of repeats during assembly. We argue that, due to affordable high-throughput sequencing technologies, coverage gaps are far less frequent than they used to be, and assembly fragmentation is primarily caused by repeats. Thus improvements in repeat resolution algorithms will translate into substantial improvements in the quality of the resulting assemblies.
Resolving repeats with mate-pairs
All previously published techniques for repeat resolution rely on the same basic observation: If a unique path in the assembly graph can be found that connects the sequences at the ends of a mate-pair and the length of this path matches the approximately known mate-pair size, then this path can be inferred to be correct; i.e., the path represents a partial traversal of the graph that is consistent with the correct reconstruction of the genome being assembled. Since the problem of finding a path of predefined length through a graph is NP-hard , the algorithms used during assembly rely on various heuristics for efficiently finding paths that support mate-pair information.
The EULER assembler  greedily finds paths whose lengths are consistent with the size of mate-pairs (the authors indicate that such paths can be easily found for a majority of mate-pairs), then converts these paths into artificial long reads that can be processed through the Eulerian superpath algorithm . Conflicts between paths through the graph are resolved by prioritizing paths on the basis of their support (number of mate-pairs that confirm a given path) . The Velvet [7, 17] and ALLPATHS  assemblers take into account the uniqueness of nodes in the assembly graph. Specifically, mate-pair links are considered only if they are anchored in contigs that correspond to non-repeated sequences in the genome (usually determined through depth of coverage statistics).
Analyses of the effectiveness of such algorithms have typically assumed the parameters of the sequencing experiment to be fixed; i.e., the goal is to build the best assembly possible given the types of data commonly generated in current sequencing projects. An exception is the study by Chaisson et al.  where the authors evaluate the effect of read length and read quality on the ability to reconstruct a genome. In other words, they assume one can tune the read length generated by a sequencing machine (which is a realistic assumption for the Illumina technology) and estimate whether the assembly improves, and by how much, if the length of reads is increased. Such analyses are critical for providing a scientific basis for picking the optimal trade-off between sequencing cost (which increases with the read length) and quality of assembly.
In our work, we pose a complementary question: How useful are mate-pairs for resolving repeats in de novo assemblies created from short-reads? Which types of mate-pair libraries most effectively resolve repeats and minimize the amount of manual finishing needed to complete the genome (given that there is some restriction on the number of mate-pair libraries that can be cost-effectively produced in a real sequencing experiment)? Similar questions have been addressed by recent publications in the context of sequence alignment. Chikhi et al.  evaluate how read length affects genome resequencing experiments, and Bashir et al.  developed a method to identify the mate-pair sizes that are optimal for detecting structural variation through mapping. However, the question of how effective mate-pairs are for the resolution of repeats and how this parameter of the sequencing experiment can be adjusted to improve assemblies has not previously been analyzed in a systematic fashion.
Measuring assembler performance
Metrics commonly used for comparing the quality of genome assemblies are primarily focused on statistics derived from the global distribution of contig sizes. Statistics such as the number of contigs, average contig size, and N50 contig size are frequently reported in the literature. (N50 contig size is the size c for which 50% of the bases in the genome are contained in contigs of size at least c.)
When the correct answer is known (e.g., reassembly of a known genome), one can also record the number of errors found in the assembly. An alternative approach is proposed by Chaisson et al.  where they compare the results of an assembly to a theoretical optimal: the best assembly that can be reconstructed without errors from a genome, given the repeat graph (see Methods) of that genome.
In our study we are specifically targeting this theoretical optimum: in an idealized setting (perfect sequencing data), what is the best possible assembly that can be obtained given the parameters of the sequencing process? While aiming for this theoretical optimum, we also maintain a dose of reality about certain aspects of sequencing projects, restricting ourselves to the use of only two mate-pair libraries (a fairly standard procedure due to costs) and use of read-lengths that are reflective of inexpensive next-generation sequencing. Like Chaisson et al.  and Kingsford et al. , we start with the idealized repeat graph of a genome (see Methods), the structure of which is determined by the read length. Although we are attempting to maximize the size of contigs that can be unambiguously reconstructed from this graph, contig size statistics are difficult to compare across genomes and may not adequately describe the amount of repeat resolution that a set of mate-pairs provide. Therefore, we focus instead on a measure of assembly ambiguity that is directly related to unresolved repeats. Specifically, we measure the number of manual experiments that would be necessary to completely resolve the structure of a genome during finishing. Briefly, a path through the repeat graph can be uniquely determined by pairing up the edges adjacent to repeat nodes. This pairing is commonly determined during finishing through targeted PCR experiments.
We measure the usefulness of a mate-pair library in terms of the amount of manual finishing effort that can be saved through its use (see Methods for details). We show that, in the context of high-depth prokaryotic sequencing experiments, very short mate-pairs are more useful than long mate-pairs (sizes as high as 40,000 bp are often generated in sequencing projects) for resolving repeats, and that choosing mate-pair sizes based on the repeat structure (defined in Methods) of the assembly graph is a powerful approach for creating more complete assemblies. Our results hold across 360 'ideal' sequencing experiments (see following section) as well as when using an off-the-shelf assembler in the presence of base-call errors (assembly of 8 bacterial genomes using SOAPdenovo ). These results can serve as a basis for developing new algorithms and sequencing strategies that will improve de novo assemblies. Due to computational efficiency considerations we limited our analysis to prokaryotic genomes. Whether or not our results can be extended to eukaryotes (whose repeat structure is often quite different from prokaryotes) remains an exercise for future work.