Near-optimal assembly for shotgun sequencing with noisy reads

Recent work identified the fundamental limits on the information requirements in terms of read length and coverage depth required for successful de novo genome reconstruction from shotgun sequencing data, based on the idealistic assumption of no errors in the reads (noiseless reads). In this work, we show that even when there is noise in the reads, one can successfully reconstruct with information requirements close to the noiseless fundamental limit. A new assembly algorithm, X-phased Multibridging, is designed based on a probabilistic model of the genome. It is shown through analysis to perform well on the model, and through simulations to perform well on real genomes.


Background
Optimality in the acquisition and processing of DNA sequence data represents a serious technology challenge from various perspectives including sample preparation, instrumentation and algorithm development.Despite scientific achievements such as the sequencing of the human genome and ambitious plans for the future [17,14], there is no single, overarching framework to identify the fundamental limits in terms of information requirements required for successful output of the genome from the sequence data.
Information theory has been successful in providing the foundation for such a framework in digital communication [15], and we believe that it can also provide insights into understanding the essential aspects of DNA sequencing.A first step in this direction has been taken in the recent work [1], where the fundamental limits on the minimum read length and coverage depth required for successful assembly are identified in terms of the statistics of various repeat patterns in the genome.Successful assembly is defined as the reconstruction of the underlying genome, i.e. genome finishing [13].The genome finishing problem is particularly attractive for analysis because it is clearly and unambiguously defined and is arguably the ultimate goal in assembly.There is also a scientific need for finished genomes [10] [9].Until recently, automated genome finishing was beyond reach [4] in all but the simplest of genomes.New advances using ultra-long read single-molecule sequencing, however, have reported successful automated finishing [2,6].Even in the case where finished assembly is not possible, the results in [1] provide insights on optimal use of read information since the heart of the problem lies in how one can optimally use the read information to resolve repeats.Figure 1a gives an example result for the repeat statistics of E. coli K12.The x-axis of the plot is the read length and the y-axis is the coverage depth normalized by the Lander-Waterman depth (number of reads needed to cover the genome [8]).The lower bound identifies the necessary read length and coverage depth required for any assembly algorithm to be successful with these repeat statistics.An assembly algorithm called Multibridging Algorithm was presented, whose read length and coverage depth requirements are very close to the lower bound, thus tightly characterizing the fundamental information requirements.The result shows a critical phenomenon at a certain read length L = crit : below this critical read length, reconstruction is impossible no matter how high the coverage depth; slightly above this read length, reconstruction is possible with Lander-Waterman coverage depth.This critical read length is given by crit = max{ int , tri }, where int is the length of the longest pair of exact interleaved repeats and tri is the length of the longest exact triple repeat in the genome, and has its roots in earlier work by Ukkonen on Sequencing-by-Hybridization [18].The framework also allows the analysis of specific algorithms and the comparison with the fundamental limit; the plot shows for example the performance of the Greedy Algorithm and we see that its information requirement is far from the fundamental limit.
A key simplifying assumption in [1] is that there are no errors in the reads (noiseless reads).However reads are noisy in all present-day sequencing technologies, ranging from primarily substitution errors in Illumina R platforms, to primarily insertion-deletion errors in Ion Torrent R and PacBio R platforms.The following question is the focus of the current paper: in the presence of read noise, can we still successfully assemble with a read length and coverage depth close to the minimum in the noiseless case?A recent work [5] with an existing assembler suggests that the information requirement for genome finishing substantially exceeds the noiseless limit.However, it is not obvious whether the limitations lie in the fundamental effect of read noise or in the sub-optimality of the algorithms in the assembly pipeline.

Results
The difficulty of the assembly problem depends crucially on the genome repeat statistics.Our approach to answering the question of the fundamental effect of read noise is based on design and analysis using a parametric probabilistic model of the genome that matches the key features of the repeat statistics we observe in genomes.In particular, it models the presence of long flanked repeats which are repeats flanked by statistically uncorrelated region.Figure 1b shows a plot of the predicted information requirement for reliable reconstruction by various algorithms under a substitution error rate of 1%.The plot is based on analytical formulas derived under our genome model with parameters set to match the statistics of E. coli K12.We show that it is possible in many cases to develop algorithms that approach the noiseless lower bound even when the reads are noisy.Specifically, the X-phased Multibridging Algorithm has close to the same critical read length L = crit as in the noiseless case and only slightly greater coverage depth requirement for read lengths greater than the critical read length.
We then proceed to build a prototype assembler based on the analytical insights and we perform experiments on real genomes.As shown in Figure 2, we test the prototype assembler by using it to assemble noisy reads sampled from 4 different genomes.At coverage and read length indicated by a green circle, we successfully  assemble noisy reads into one contig (in most cases with more than 99% of the content matched when compared with the ground truth).Note that the information requirement is close to the noiseless lower bound.Moreover, the algorithm (X-phased Multibridging) is computationally efficient with the most computational expensive step being the computation of overlap of reads/K-mers, which is an unavoidable procedure in most assembly algorithms.
The main conclusion of this work is that, with an appropriately designed assembly algorithm, the information requirement for genome assembly is surprisingly insensitive to read noise.The basic reason is that the redundancy required by the Lander-Waterman coverage constraint can be used to denoise the data.This is consistent with the asymptotic result obtained in [11] and the practical approach taken in [2].However, the result in [11] is based on a very simplistic i.i.d.random genome model, while the model and genomes considered in the present paper both have long repeats.A natural extension of the Multibridging Algorithm in [1] to handle noisy reads allows the resolution of these long flanked repeats if the reads are long enough to span them, thus allowing reconstruction provided that the read length  is greater than L = ˜ crit = max{ ˜ int , ˜ tri }, where ˜ int is the length of the longest pair of flanked interleaved repeats and ˜ tri is the length of the longest flanked triple repeat in the genome.This condition is shown as a vertical asymptote of the "Multibridging Algorithm" curve in Figure 1b.By exploiting the redundancy in the read coverage to resolve read errors, the X-phased Multibridging can phase the polymorphism across the flanked repeat copies using only reads that span the exact repeats.Hence, reconstruction is achievable with a read length close to L = crit , which is the noiseless limit.

Related work
All assemblers must somehow address the problem of resolving noise in the reads during genome reconstruction.However, the traditional approaches to measuring assembly performance makes quantitative comparisons challenging for unfinished genomes [12].In most cases, the heart of the assembly problem lies in processing of the assembly graph, as in [19,3,16].A common strategy for dealing with ambiguity from the reads lies in filtering the massively parallel sequencing data using the graph structure prior to traversing possible assembly solutions.In the present work, however, we are focused on the often-overlooked goal of optimal data efficiency.Thus, to the extent possible we distinguish between the read error and the mapping ambiguity associated with the shotgun sampling process.The proposed assembler, X-phased Multibridging, adds information to the assembly graph based on a novel analysis of the underlying reads.

Methods
The path towards developing X-phased Multibridging is outlined as follows.
1 Setting up the shotgun sequencing model and problem formulation.
2 Analyzing repeats structure of genome and their relationship to the information requirement for genome finishing.3 Developing a parametric probabilistic model that captures the long tail of the repeat statistics.4 Deriving and analyzing an algorithm that require minimal information requirements for assembly -close to the noiseless lower bound.5 Performing simulation-based experiments on real and synthetic genomes to characterize the performance of a prototype assembler for genome finishing.6 Extending the algorithm to address the problem of indel noise.

Sequencing model
Let s be a length G target genome being sequenced with each base in the alphabet set Σ = {A, C, G, T }.In the shotgun sequencing process, the sequencing instrument samples N reads, r 1 , . . ., r N of length L and sampled uniformly and independently from s.This unbiased sampling assumption is made for simplicity and is also supported by the characteristics of single-molecule (e.g.PacBio R ) data.Each read is a noisy version of the corresponding length L substring on the genome.The noise may consist of base insertions, substitutions or deletions.Our analysis focus on substitution noise first.In a later section, indel noise is addressed.In the substitution noise model, let p be the probability that a base is substituted by another base, with probability p/3 to be any other base.The errors are assumed to be independent across bases and across reads.

Formulation
Successful reconstruction by an algorithm is defined by the requirement that, with probability at least 1 − , the reconstruction ŝ is a single contig which is within edit distance δ from the target genome s.If an algorithm can achieve that guarantee at some (N, L), it is called -feasible at (N, L).This formulation implies automated genome finishing, because the output of the algorithm is one single contig.The fundamental limit for the assembly problem is the set of (N, L) for which successful reconstruction is possible by some algorithms.If ŝ is directly spelled out from a correct placement of the reads, the edit distance between ŝ and s is of the order of pG, where the error rate is p.This motivates fixing δ = 2pG for concreteness.The quality of the assembly can be further improved if we follow the assembly algorithm with a consensus stage in which we correct each base, e.g. with majority voting.But the consensus stage is not the focus in this paper.

Repeats structure and their relationship to the information requirement for successful reconstruction
Long exact repeats and their relationship to assembly with noiseless reads We take a moment to carefully define the various types of exact repeats.Let s t denote the length-substring of the DNA sequence s starting at position t.An exact repeat of length is a substring appearing twice, at some positions t 1 , t 2 (so s t1 = s t2 ) that is maximal (i.e.s(t 1 − 1) = s(t 2 − 1) and s(t 1 + ) = s(t 2 + )).
A copy of a repeat is a single one of the instances of the substring appearances.A pair of exact repeats refers to two exact repeats, each having two copies.A pair of exact repeats, one at positions t 1 , t 3 with t 1 < t 3 and the second at positions t 2 , t 4 with t 2 < t 4 , is interleaved if t 1 < t 2 < t 3 < t 4 or t 2 < t 1 < t 4 < t 3 .The length of a pair of exact interleaved repeats is defined to be the length of the shorter of the two exact repeats.A typical appearance of a pair of exact interleaved repeat is -X-Y-X-Y-where X and Y represent two different exact repeat copies and the dashes represent non-identical sequence content.
We let max be the length of the longest exact repeat, int be the length of the longest pair of exact interleaved repeats and tri be the length of the longest exact triple repeat.
As mentioned in the introduction, it was observed that the read length and coverage depth required for successful reconstruction using noiseless reads for many genomes is governed by long exact repeats.For some algorithms (e.g.Greedy Algorithm), the read length requirement is bottlenecked by max .The Multibridging Algorithm in [1] can successfully reconstruct the genome with a minimum amount of information.The corresponding minimum read length requirement is the critical exact repeat length crit = max( int , tri ).

Flanked repeats
While exact repeats are defined as the segments terminated on each end by a single differing base (Fig 3a), flanked repeats are defined by the segments terminated on each end by a statistically uncorrelated region.We call that ending region to be the random flanking region.A distinguishing characteristic of the random flanking region is a high Hamming distance to segment length ratio between the ends of two repeat copies.The ratio in the random flanking region is around 0.75, which matches with that when the genomic content is independently and uniformly randomly generated.We observe that long repeats of many genomes terminate with random flanking region.Additional statistical analysis is detailed in the Appendix.
If the repeat interior is exactly the same between two copies of the flanked repeat (Fig 3b), the corresponding flanked repeat is called a flanked exact repeat.If there are a few edits (called polymorphism) within the repeat interior (Fig 3c ), the corresponding flanked repeat is called a flanked approximate repeat.
The length of the repeat interior bounded by the random flanking region is then the flanked repeat length.We let ˜ max be the length of the longest flanked repeat, ˜ int be the length of the longest pair of flanked interleaved repeats and ˜ tri be the length of the longest flanked triple repeat.The critical flanked repeat length is then Long flanked exact repeats and their relationship to assembly with noisy reads If all long flanked repeats are flanked exact repeats, we can utilize the information in the random flanking region to generalize Greedy Algorithm and Multibridging Algorithm to handle noisy reads.The corresponding information requirement is very similar to that when we are dealing with noiseless reads.
The key intuition is as follows.A criterion for successful reconstruction is the existence of reads to span the repeats to their neighborhood.When a read is noiseless, it only need to be long enough to span the repeat interior to its neighborhood by one base (Fig 4a) so as to differentiate between two exact repeat copies.When a read is noisy, it then need to be long enough to span the repeat interior plus a short extension into the random flanking region (Fig 4b) so as to confidently differentiate between two flanked repeat copies.However, the Hamming distance between two flanked repeat copies' neighborhood in the random flanking region is very high even within a short length.This can be used to differentiate between two flanked repeat copies confidently even when the reads are noisy.The short extension into the random flanking region has a length which is typically of order of tens whereas the long repeat length is of order of thousands.Therefore, relative to the repeat length, the change of the critical read length requirement from handling noiseless reads to noisy reads is only marginal when all long repeats are flanked exact repeats.
Long flanked approximate repeats and their relationship to assembly with noisy reads If a long flanked repeat is a flanked approximate repeat, the flanked repeat length may be significantly longer than the length of its longest enclosed exact repeat.Merely relying on the information provided by the random flanking region requires the reads to be of length longer than the flanked repeat length for successful reconstruction.This explains why the information requirement for Greedy Algorithm and Multibridging Algorithm has a significant increase when we use noisy reads instead of noiseless reads (as shown in Fig 1b).However, if we utilize the information provided by the coverage, we can still confidently differentiate different repeat     Multibridging Algorithm(Alg 2) also utilizes the random flanking region but it improves upon Greedy Algorithm by using a De Bruijn graph to aid the resolution of flanked repeats.The minimum read length needed for successful reconstruction is close to ˜ crit .
X-phased Multibridging(Alg 3) further utilizes the coverage given by multiple reads to phase the polymorphism within the repeat interior of flanked approximate repeats.The minimum read length needed for successful reconstruction is close to crit , which is the noiseless lower bound even when some long repeats are flanked approximate repeats.

Model for genome
To capture the key characteristics of repeats and to guide the design of assembly algorithms, we use the following parametric probabilistic model for genome.A target genome is modeled as a random vector s of length G that has the following three key components (a pictorial representation is depicted in Figure 5).
Random background: The background of the genome is a random vector, composed of uniformly and independently picked bases from the alphabet set Σ = {A, C, G, T }.
Long flanked repeats: On top of the random background, we randomly position the longest flanked repeat and the longest flanked triple repeat.Moreover, we randomly position a flanked repeat interleaving the longest flanked repeat, forming the longest pair of flanked interleaved repeat.The corresponding length of the flanked repeats are ˜ max , ˜ tri and ˜ int respectively.It is noted that ˜ max > max( ˜ int , ˜ tri ).
Polymorphism and long exact repeats: Within the repeat interior of the flanked repeats, we randomly position n max , n int and n tri edits (polymorphism) respectively.The sites of polymorphism are chosen such that the longest exact

Greedy Algorithm
Read R 2 is a successor of read R 1 if there exists length-W suffix of R 1 and length-W prefix of R 2 such that they are extracted from the same locus on the genome.Furthermore, there is no other reads that can satisfy the same condition with a larger W .To properly determine successors of reads in the presence of long repeats and noise, we need to define an appropriate overlap rule for reads.In this section, we show the conceptual development towards defining such a rule, which is called RA-rule.
Noiseless reads and long exact repeats: If the reads are noiseless, all reads can be paired up with their successors correctly with high probability when the read length exceeds max .It was done [1] by greedily pairing reads and their candidate successors based on their overlap score in descending order.When a read and a candidate successor are paired, they will be removed from the pool for pairing.
Here the overlap score between a read and a candidate successor is the maximum length such that the suffix of the read and prefix of the candidate successor match exactly.
Noisy reads and random background: Since we cannot expect exact match for noisy reads, we need a different way to define the overlap score.Let us consider the following toy situation.Assume that we have exactly one length-( + 1) noisy read starting at each locus of a length G random genome(i.e.only consists of the random background).Each read then overlaps with its successor precisely by bases.Analogous to the noiseless case, one would expect to pair reads greedily based on overlap score.Here the overlap score between a read and a candidate successor is the maximum length such that the suffix (x) of the read and prefix (y) of the candidate successor match approximately.To determine whether they match approximately, one can use a predefined a threshold factor α and compute the Hamming distance d(x, y).If d(x, y) ≤ α • , then they match approximately, otherwise not.Given this decision rule, we can have false positive (i.e.having any pairs of reads mistakenly paired up) and false negative (i.e.having any reads not paired up with the true successors).If false positive and false negative probability are small, this naive method is a reliable enough metric.This can be achieved by using a long enough length > iid and an appropriately chosen threshold α.
Recall that is the overall failure probability.By bounding the sum of false positive and false negative probability by /3, one can find iid (p, /3, G) and α(p, /3, G) to Figure 6: Intuition about why we define the overlap rule to be RA-overlap rule be the ( iid , α) solution to the following pair of equations: where Noisy reads and long flanked repeats: However, when the genome contains long flanked repeats on top of the random background, this naive rule of determining overlap is not enough.Let us look at the example in Fig. 6.As shown in Fig. 6, because of long flanked repeats, we have a small ratio of overall distance against the overlap length for the segments that are extracted from different copies of the repeat (e.g Segment 1 and Segment 3 in Fig. 6).Therefore, the overall Hamming distance between two segments is not a good enough metric for defining overlap.If we abide by the naive rule, we need to increase the read length significantly longer than the flanked repeat length so as to guarantee confidence in deciding approximate match.Otherwise, it will either result in a high false positive rate (if we set a large α) or a high false negative rate (if we set a small α).To properly handle such scenario, we define a repeat-aware rule(or RA-rule).
• RA-matching: Two segments (x, y) of length W match under the RA-rule if and only if the distance between whole segments is < α • W and both of its ending segments(of length iid ) also have distance < α • iid .• RA-overlap: The overlap score between a read and a candidate successor under the RA-rule is the maximum length such that the suffix of the read and prefix of the candidate successor match under the RA-matching.The RA-rule is particularly useful because it puts an emphasis on both ends of the overlap region.Since the ends are separated by a long range, one end will hopefully originate from the random flanking region of the flanked repeat.If we focus on the segments originating from the random flanking region, the distance per segment length ratio will be very high when the segments originate from different copies of the repeat but very low when they originate from the same copy of the repeat.This is how we utilize the random flanking region to differentiate between repeat copies and determine correct successors in the presence of long flanked repeats and noise.
If we use Greedy Algorithm (Alg 1) to merge reads greedily with this overlap rule (RA-rule), Prop 1 shows the information requirement under the previously described sequencing model and genome model.A plot is shown in Fig 1b .Since iid is of order of tens whereas ˜ max is of order of thousands, the read length requirement for Greedy Algorithm to succeed is dominated by ˜ max .The detailed proof of Prop 1 is given in Appendix.

Algorithm 1 Greedy Algorithm
Initialize contigs to be reads for W = L to iid do if any two contigs x,y are of overlap W under RA-rule then merge x, y into one contig.end end

Multibridging Algorithm
The read length requirement of Greedy Algorithm has a bottleneck around ˜ max because it requires at least one copy of each flanked repeat to be spanned by at least one read for successful reconstruction.Spanning a repeat by a single read is called bridging in [1].A natural question is whether we need to have all repeats bridged for successful reconstruction.
In the noiseless setting, [1] shows that this condition can be relaxed.Using noiseless reads, one can have successful reconstruction given all copies of each exact triple repeat being bridged, and at least one copy of one of the repeats in each pair of exact interleaved repeats being bridged.
A key idea to allow such a relaxation in [1] is to use a De Bruijn graph to capture the structure of the genome.
When the reads are noisy, we can utilize the random flanking region to specify a De Bruijn graph with high confidence by RA-rule and arrive at a similar relaxation.By some graph operations to handle the residual errors, we can have successful reconstruction with read length ˜ crit + 2 • iid < L < ˜ max .The algorithm is summarized in Alg 2. Prop 2 shows its information requirement under the previously described sequencing model and genome model.A plot is shown in Fig 1b .We note that Alg 2 can be seen as a noisy reads generalization of Multibridging Algorithm for noiseless reads in [1].

Description and its performance
Algorithm 2 Multibridging Algorithm Detailed proof is given in the Appendix.The following sketch highlights the motivation behind the key steps of Multibridging Algorithm.
[Step1] We set a large K value to make sure the K-mers overlapping the shorter repeat of the longest pair of flanked interleaved repeats and the longest flanked triple repeat can be separated as distinct clusters.
[Step2] Clustering is done using the RA-rule because of the existence of long flanked repeats and noise.
[Step3] A K-mer cluster corresponds to an equivalence class for K-mers matched under the RA-rule.This step forms a De Bruijn graph with K-mer clusters as nodes.
[Step4] Because of large K, the graph can be disconnected due to insufficient coverage.In order to reduce the coverage constraint, we connect the clusters greedily.
[Step6] Branch clearing repairs any incorrect merges near the boundary of long flanked repeat.
[Step8] Since an Euler path in the condensed graph corresponds to the correct genome sequence, it is traversed to form the reconstructed genome.

Some implementation details: improvement on time and space efficiency
For Multibridging Algorithm, the most computational expensive step is the clustering of K-mers.To improve the time and space efficiency, this clustering step can be approximated by performing pairwise comparison of reads.
Based on the alignment of the reads, we can cluster K-mers from different reads together using a disjoint set data structure that supports union and find operations.Since only reads are used in the alignment, only the K-mer indices along with their associated read indices and offsets need to be stored in memory-not all the Kmers.
Pairwise comparison of reads roughly runs in Θ(N 2 L 2 ) if done in the naive way.To speed up the pairwise comparison of noisy reads, one can utilize the fact that the read length is long.We can extract all consecutive f -mers (which act as fingerprints) of the reads and do a lexicographical sort to find candidate neighboring reads and associated offsets for comparison.Since the reads are long, if two reads overlap, there should exist some perfectly matched f -mers which can be identified after the lexicographical sort.This allows an optimized version of Multibridging Algorithm to run in Θ(N L • N L G ) time and Θ(N Lf ) space.

X-phased Multibridging
As shown in Fig 1b, when long repeats are flanked approximate repeats, there can be a big gap between the noiseless lower bound and the information requirement for Multibridging Algorithm.A natural question is whether this is due to a fundamental lack of information from the reads or whether Multibridging Algorithm does not utilize all the available information.In this section, we demonstrate that there is an important source of information provided by coverage which is not utilized by Multibridging Algorithm.In particular, we introduce X-phased Multibridging, an assembly algorithm that utilizes the information provided by coverage to phase the polymorphism in long flanked repeat interior.The information requirement of Xphased Multibridging is close to the noiseless lower bound (as shown in Fig 1b) even when some long repeats are flanked approximate repeats.

Description of X-phased Multibridging
Multibridging Algorithm utilizes the random flanking region to differentiate between repeat copies.However, for a flanked approximate repeat, its enclosed exact repeat does not terminate with the random flanking region but only terminates with sparse polymorphism.When we consider the overlap of two reads originating from different copies of a flanked approximate repeat, the distinguishing polymorphism is so sparse that it cannot be used to confidently differentiate between repeat copies.Therefore, there is a need to use the extra redundancy introduced by the coverage from multiple reads to confidently differentiate between repeat copies and that is what X-phased Multibridging utilizes.
X-phased Multibridging (Alg 3) follows the algorithmic design of Multibridging Algorithm.However, it adds an extra phasing procedure to differentiate between repeat copies of long flanked repeats that Multibridging Algorithm cannot confidently differentiate.We recall that after running step 7 of Multibridging Algorithm, a node in the graph G string corresponds to a substring of the genome and has node label  To be concrete, we focus the discussion on a pair of flanked interleaved repeats, assuming triple repeats are not the bottleneck.However, the ideas presented can be generalized to repeats of more copies.
For the flanked approximate repeat with length int < L and ˜ int > L (as shown in Fig 7), there is no node-disjoint paths joining incoming/outgoing random flanking region with the distinct repeat copies in G string .It is because the reads are corrupted by noise and the polymorphism is too sparse to differentiate between the repeat copies.Executing Multibridging Algorithm directly will result in the formation of an X-node, which is an artifact due to K-mers from different copies of the flanked approximate repeat erroneously clustered together.
Successful reconstruction requires an algorithm to pair up the correct incoming/outgoing nodes of the X-node(i.e. decide how W, W and Y, Y are linked in Fig 7).This is handled by the phasing procedure in X-phased Multibridging, which uses all the reads information.The phasing procedure is composed of two main steps: • Consensus step: Confidently find out where the sites of polymorphism are located within the flanked repeat interior.• Read extension step: Confidently determine how to extend reads using the random flanking region and sites of polymorphism as anchors.Consensus step For the X-node of interest, let D be the set of reads originating from any sites of the associated flanked repeat region and let x 1 and x 2 denote the associated repeat copies.Since the random flanking region is used as anchor, it is treated as the starting base (i.e.x 1 (0) = W and x 2 (0) = W ). For the i th subsequent site of the flanked repeat (where 1 ≤ i ≤ ˜ int ), we determine the consensus according to Eq (3).This can be implemented by counting the frequency of occurrence of each alphabet overlapping at each site of the repeat.The consensus result determines the sites of polymorphism and the most likely pairs of bases at the sites of polymorphism.
Read extension step After knowing the sites of polymorphism, we use those reads that span the sites of polymorphism or random flanking region to help decide how to extend reads across the flanked repeat.Let σ be the possible configuration of alphabets at the sites of polymorphism and random flanking region (e.g.σ = (ACY, GT Y ) means that the two copies of the flanked repeat with the corresponding random flanking region respectively are W-A-C-Y, W'-G-T-Y' where the common bases are omitted).
The following maximum a posteriori estimation is used to decide the correct configuration.
where σ is the estimator, D is the raw read set, and x 1 , x 2 are the estimates from the consensus step.It is noted that the size of the feasible set for σ is 2 nint+1 .In practice, for computational efficiency, the maximization in Eq (4) can be approximated accurately even if it is replaced by the simple counting illustrated in Fig 7, which we call count-to-extend algorithm(countAlg).CountAlg uses the raw reads to establish majority vote on how one should extend to the next sites of polymorphism using only the reads that span the sites of polymorphism.

Performance
After introducing the phasing procedure in X-phased Multibridging, we proceed to find its information requirement for successful reconstruction.
The information requirement for X-phased Multibridging is the amount of information required to reduce the error of the phasing procedure to a negligible level.The phasing procedure -step 2 in Alg. 3 -is a combination of consensus and read extension steps, which contribute to the error as follows.

Algorithm 3 X-phased Multibridging
Let E be the error event of the repeat phasing procedure for a repeat, 1 be the error probability for the consensus step, 2 be the error probability for the read extension step given k reads spanning each consecutive site of polymorphism within the flanked repeat, δ cov be the probability for having k reads spanning each consecutive sites of polymorphism(i.e.k bridging reads) within the flanked repeat.We have, Therefore, to guarantee confidence in the phasing procedure, it suffices to upper bound 1 , 2 and δ cov .We tabulate the error probabilities of 1 , 2 in Table 1 for phasing a flanked repeat (whose length is 5000 whereas the genome length is 5M).
The flanked repeat has two sites of polymorphism which partition it into three equally spaced segments.
From Table 1, when p = 0.01, the information requirement translates to the condition of having three bridging reads spanning the shorter exact repeat of the longest pair of exact interleaved repeats.Therefore, the information requirement for X-phased Multibridging shown in Fig 1b also corresponds to this condition.It is noted that X-phased Multibridging has the same vertical asymptote as the noiseless lower bound.The vertical shift is due to the increase of requirement on the number of bridging reads from k = 1 (noiseless case) to k = 3 (noisy case).

Simulation of the prototype assembler
Based on the algorithmic design presented, we implement a prototype assembler for automatic genome finishing using reads corrupted by substitution noise.First, the assembler was tested on synthetic genomes, which were generated according to the genome model described previously.This demonstrates a proof-of-concept that one can achieve genome finishing with read length close to crit , as shown in Fig 8.
The number on the line represents the number of simulation rounds (out of 100) in which the reconstructed genome is a single contig with ≥ 99% of its content matching the ground truth.
Second, the assembler was tested using synthetic reads, sampled from genome ground truth downloaded from NCBI.The assembly results are shown in Table 2.The observation from the simulation result is that we can assemble genomes to finishing quality with information requirement near the noiseless lower bound.More information about the detail design of the prototype assembler is presented in the Appendix and source code/data set can be found in [7].Table 2: Simulation results on the assembly of several real genomes using reads corrupted by substitution noise ((a) Prochlorococcus marinus (b) Helicobacter pylori (c) Methanococcus maripaludis (d) Mycoplasma agalactiae) with crit = max( int , tri ) , ˜ crit = max( ˜ int , ˜ tri ) and N noiseless is the lower bound on number of reads in the noiseless case for 1 − = 95% confidence recovery

Extension to handle indel noise
A further extension of the prototype assembler addresses the case of reads corrupted by indel noise.Similar to the case of substitution noise, tests were performed on synthetic reads sampled from real genomes and synthetic genomes.Simulation results are summarized in Table 3 where p i , p d are insertion probability and deletion probability and rate is the number of successful reconstruction(i.e.simulation rounds that show mismatch < 5%) divided by total number of simulation rounds.The simulation result for indel noise corrupted reads shows that X-phased Multibridging can be generalized to assemble indel noise corrupted reads.The information requirement for automated finishing is about a factor of two from the noiseless lower bound for both N and L. We remark that one non-trivial generalization is the way that we form the noisy De Bruijn graph for K-mer clusters.In particular, we first compute the pairwise overlap alignment among reads, then we use the overlap alignment to group Kmers into clusters.Subsequently, we link successive cluster of K-mers together as

Conclusion
In this work, we show that even when there is noise in the reads, one can successfully reconstruct with information requirements close to the noiseless fundamental limit.A new assembly algorithm, X-phased Multibridging, is designed based on a  The main conclusion of this work is that, with an appropriately designed assembly algorithm, the information requirement for genome assembly is insensitive to moderate read noise.We believe that the information theoretic insight is useful to guide the design of future assemblers.We hope that these insights allow future assemblers to better leverage the high throughput sequencing read data to provide higher quality assembly.
3. With the choice of iid = iid (p, 3 , G), we have, )) )) Here we use the fact that there are indeed 4 types of overlap as in Fig 10 .And given the bridging condition, we are only left with 2 types, namely, both ending segments outside/exactly one ending segment outside the longest repeat repeat region.

Simple De Bruijn Algorithm
Before continuing proving the performance of Multibridging Algorithm, it is instructive to analyze the following Simple De Bruijn Algorithm (Alg 4) because this is closely related to the Multibridging Algorithm.Here we first define several genomic region of interest which we will refer to in the proofs below.a) S 0 = set of K-mers that are completely inside iid − neighborhood of the longest repeat b) S 1 = set of K-mers that are completely inside the longest repeat c) S 2 = S 0 \S 1 Lemma 5 Here we provide several deterministic conditions that guarantee the success of the algorithm.
1. Successive reads overlap with length at least K 2 K-mers are almost correctly clustered, that is, a) K-mers from the same locus but not merged b) x not in S 0 s.t.x get clustered with wrong K-mers c) x in S 0 s.t.x get clustered with elements other than its own cluster/mirror cluster(mirror cluster is defined to be the cluster for the other copy of the repeat) 3) Repeat at both circle are at least 2 • iid separated(the interleaving segments between the repeat differ with at least 2 iid in length) Proof We note that every length K segments x ∈ S 0 , they are represented as a distinct node in the K-mer graph because of the length K that we pick and the condition that successive reads overlap at least K bases.Moreover, for K-mers x ∈ S 1 , they are condensed into the repeat as 'X' in L−max(lint,l triple )−2 iid , with iid = iid (p, 3 , G) α = α(p, 3 , G), then, P(S C ) ≤ Proof We first note that in order to obtain a bound on the error probability, we only need to separately bound the probability that each of the conditions in Lemma 5 fail,which are ≤ 3 each.Thus, combining, we get, P(S C ) ≤ .

Multibridging Algorithm
An illustration of noisy Multibridging Algorithm is shown in Fig (12).Lemma 7 Here are the deterministic conditions for the algorithm to succeed.
1) Every successive reads overlap at least iid (p, 3 , G) 2) K-mers are almost correctly clustered, that is, a) K-mers from the same locus but not merged b) x not in S 0 s.t.x get clustered with wrong K-mers c) x in S 0 s.t.x get clustered with elements other than its own cluster/mirror cluster  Proof Along the same lines as the proof in Lemma (5), we only note that in this algorithm, we have an extra step of finding predecessor/successors.Moreover, the overlap here is significantly reduced to only iid instead of K in the Noisy Simple De Bruijn case.
Proof Here we note that with the given coverage, bridging conditions of the interleave repeat and the triple repeat are satisfied.And when this is true, then Condition 4 in Lemma 7 is true with high probability.Following the arguments in Theorem 6, we get desired.
Appendix: Design and additional algorithmic components for the prototype assembler

Pipeline of the prototype assembler
The pipeline of the prototype assembler is shown in Fig 13.With a ground truth genome as input, the output is the performance of the whole pipeline by giving the mismatch rate.Regarding that, we need to have a more robust branch clearing step.In particular, we first classfy nodes as "big" or "small" nodes based on the size of the nodes in the sequence graph.The key idea is to merge the small nodes together while keeping the big nodes unchanged.Starting from each big nodes, we tranverse the graph to detect all the small nodes that link the current big node to other big nodes.Then, we classify the small nodes into levels(depending on its distance from the current big node).After that, the small nodes in the same level are merged.Finally, we note that we keep the reachability among each big nodes.

Enhanced Multibridging Algorithm that can resolve middle range repeats
We note that the ideas presented here can also be found in the prior work on the treatment of noiseles sreads.It is stated here for completeness.In the noisy setting, instead of considering the alphabet set to be Σ = {A, C, G, T }, one can consider the alphabet set as the cluster index of the K-mers.

Algorithm 5 Enhanced Multibridging Algorithm
Resolution of repeats: 0. Intially the weight of the edge are set to be 1. 1.While there is a X-node v : a) For each edge (p i ,v) with weight ap i ,v , create a new node u i = p i → v and an edge (p i , u i ) with weight 1 + ap i v .Similarly, for each edge (v,q j ), create a new node w j = v →q j and an edge (w j , q j ) b) If v has a self-loop (v, v) with weight av,v, add an edge (v →v , v→ v) with weight av,v + 2 c) Remove node v and all incident edges d) For each pair of u i , w j adjacent in a read(extending to at least length of iid on both sides of the X-node), add edge (u i , w j ).If exactly on each of the u i and w j nodes have no added edge, add the edge.e) Condense the graph

Formation of K-mer De Bruijn graph for indel corrupted reads
In order to form K-mer De Bruijn graph for indel corrupted reads, we first need to have a clear notion of K-mers.We define K-mers to be the length K segments in the genome ground truth (as opposed to the usual definition from the reads).Although we mostly work on the reads themselves, the definition of the Kmers are based on the ground truth.In order to successfully cluster K-mers, we need to do the following steps.
1. We first compute the pairwise alignement of the reads.
2. Based on the pairwise alignment, for each length K-segments, we know which should be aligned to which.We then group them together using the alignment result.
3. Finally, we end up with the length K segments from the reads clustering together, and now we use it as an operational way to identify the Kmers since each cluster will naturally correspond to a K-mers originated from the genome groundtruth(though there are a few discrepancy, mostly this is correct).
4. After we identify the K-mers clusters, we add an edge between them if there exists a read such that there are two consecutive Kmers originate from it.

Graph surgery to clear abnormality of the noisy De Bruijn graph
Due to indel noise and runs of the same alphabet, the way that we form K-mers graph may need to abnormality of the graph.We thus perform a graph tranversal and identify the abnormality that are of short length(i.e.resulted from noise but not the genome structure).After that, we remove such abnormality.This step also involves transitive edge reduction and removal of small self loops.

Generalization to handle Indel Error
When dealing with indel noise, the neighborhood of reads can also affect consensus of the base.There we have to do sequence alignment in order to find the appropriate posterior probability in order to do a maximum likelihood estimate of whether a particular given genomic location is a site of polymorphism or not.In order to do that, we formulate the problem as a ML problem as follows. max max Here we also discuss about the places that we take approximation to enhance the computational efficiency in the steps of the previous reduction.From (6) to (7), we use some heuristics to find out the possible location of SNPs within the whole repeat in which disagreement is observed after several rounds of error correction.From (7) to (8), we remove all the reads that only span one single SNPs and it has no effect on the error of the detection problem that we are trying to solve.From (8) to (9), we further partition the reads into group in which S jk is the set of reads that only span the SNPs j to j+k.Doing this can decompose the ML problem into smaller subproblems with no effect on the accuracy.Finally, in practice, we take a first order approximation of ( 9) to (10) by only onsidering two SNPs for each subproblem.
As for each of the marginal probability distribution, the best way is to run Sum-Product algorithm to compute in a dynamic programming fashion similar to S-W alignment.But as pointed out in Quiver, this steps can be significanly speeded up using a Viterbi approximation and this is also what we implemented in the simulation code.

Simulation study
We simulated on both synthetic and real data set with indel noise and on a double stranded DNA.In the simulation, we assume that the reads from the neighborhood of a repeat is given and our goal is to decide how to extend the reads to span the repeat copies into the flanking region correctly.The correctness is evaluated based on whether they can correctly extend the correct reads into the flanking region.

Edit distance metric calibration
We also do a study on whether we can use alignment score to differentiate between segments from being extracted from the same locus or not.In Fig 14, the upper curve is the score for segment extracted from the same locus while the bottom curve is for completely iid randomly(irrelevant) generated segment.And we simulate it for 100 times at each length and the bar indicate 1 standard deviation from the mean.

Tolerance in the Multibridging step
We note that due to the indel noise and the graph surgery that we perform, an X-node of the graph may be p times longer than the usual size of the approximate repeat, thus we should have a corresponding higher tolerance to use the reads to bridge across the repeats.

Computation speed up of alignment step
The key bottleneck in computation speed of the indel extension is on the pairwise alignment of the reads, which can be speeded up using the ideas in BLAST.We use sorting to identify exact matching fingerprint that identify the starting and ending location of the segment that need to be aligned with.After that, we do a local search instead of the whole dynamic programming search.Classification of approximate repeat While repeats are studied in the literature[?],they are not investigated by looking at the ground truth.This is partially due to the insufficiency of data in the early days of genome assembly development.Therefore, based on the ground truth genome, we define several quantities that allow us to classify approximate repeat and understand the approximate repeat spectrum of genome.Here we define stretch and mutation rate.Stretch is defined to be the ratio of the length (l * ) of the longest exact repeat within an approximate repeat divided by the length (l approx ) of the approximate repeat.Mutation rate is defined to be number of mutation within approximate repeat divided by (l approx −l * ).An illustration is shown in Fig 15.
Moreover, we do a scatter plot to classify the approximate repeats(approximate repeat having exact repeat length within top 20) and we have a plot of approximate repeat spectrum as in Fig 16.
From the plots in Fig 16, we classify approximate repeat as homologous repeat if the stretch is bigger than 1.25 and as non-homologous repeat if the stretch is less than 1.25.
For the scatter plot, every approximate repeat is a dot there with x coordinate and y coordinate being mutation rate and stretch respectively.And the color represents the length of that approximate repeat.For the approximate repeat spectrum We focus on genomes when the non-homologous repeat dominates, namely the longest interleave and the longest triple repeats are non-homologous because the stretch is relatively short which can be captured by our generative model.We do not distinguish between the length of approximate or exact repeat are considered to be the same and we do not distinguish between the two in the discussion because of the small stretch.

Parametric model
Let L k be the number of bases between the (k − 1) th and the k th SNPs starting from the right end-point of a repeat.
We consider the following probabilistic model for the L k .{L k } n k=1 is taken as an independent sequence of geometrically distributed random variables with parameter Θ = {p 1 , p 2 , r} defined as follows.
(a) Information requirement for noiseless reads (b) Information requirement for noisy reads

Figure 2 :
Figure 2: Simulation results on a prototype assembler (substitution noise of rate 1.5 %)

From
information theoretic insight to algorithm design Because of the structure of long flanked repeats, there are two important sources of information that we specifically want to utilize when designing data-efficient algorithms to assemble noisy reads.They are • The random flanking region beyond the repeat interior • The coverage given by multiple reads overlapping at the same site (a) Noiseless reads spanning an exact repeat and its terminating bases (b) Noisy reads spanning a flanked exact repeat and its terminating random flanking region (c) Noisy reads extending to span a flanked approximate repeat and its terminating random flanking region

Figure 4 :
Figure 4: Intuition behind the information requirement

Figure 5 :
Figure 5: Model for genome

1 . 7 .
Choose K to be ˜ crit + 2 iid and extract K-mers from reads.2. Cluster K-mers based on the RA-rule.3. Form uncondensed De Bruijn graph G De−Bruijn = (V, E) with the following rule: • a) K-mers clusters as node set V .• b) (u, v) = e ∈ E if and only if there exists K-mers u 1 ∈ u and v 1 ∈ v such that u 1 ,v 1 are consecutive K-mers in some reads.4. Join the disconnected components of G De−Bruijn together by the following rule: for W = K − 1 to iid do for each node u which has either no predecessors / successors in G De−Bruijn do a) Find the predecessor/successor v for u from all possible K-mers clusters such that overlap length(using any representative K-mers in that cluster) between u and v is W under RA-rule.b) Add dummy nodes in the De Bruijn graph to link u with v and update the graph to G De−Bruijn end end 5. Condense the graph G De−Bruijn to form G string with the following rule:• a) Initialize G string to be G De−Bruijn with node labels of each node being its cluster group index.• b) while ∃successive nodes u → v such that out − degree(u) = 1and in − degree(v) = 1 do bi) Merge u and v to form a new node w bii) Update the node label of w to be the concatenation of node labels of u and v end 6.Clear Branches of G string : for each node u in the condensed graph G string do if out − degree(u) > 1 and that all the successive paths are of the same length(measured by the number of node labels) and then joining back to node v and the path length < iid then we merge the paths into a single path from u to v. end end Condense graph G string 8. Find the genome :• a) Find an Euler Cycle/Path in G string and output the concatenation of the node labels to form a string s labels .• b) Using s labels and look up the associated K-mers to form the final recovered genome ŝ.

Figure 7 :
Figure 7: Illustration of how to phase polymorphism to extend reads across repeats

Figure 8 :
Figure 8: Simulation results on the assembly of synthetic genomes using reads corrupted by substitution noise.The parameters are as follows.G = 10K; ˜ max = max = 500, ˜ int = 200, int = 100 with two sites of polymorphism within the flanked repeat.p = 1.5%, = 5%.

Figure 9 :
Figure 9: Treatment of reads corrupted by indel noise

Figure 10 :
Figure 10: Overlap Type Fig 11a.However, for the K-mers x ∈ S 2 , they have chances not to merge properly, thus they form into the branches surrounding 'X' in Fig 11a.Because of condition 3, branch clearing will not eliminate the 'A' or 'C' in Fig 11a, further after condensing, we get the desired K-mer graph as in Fig 11b and this can be successfully read by a Eulerian Walk.Theorem 6 If G > 6 iid , G ≥ N ≥ G•ln 3N

Figure 12 :Figure 13 :
Figure 12: An illustration of the Noisy Multi Bridging

Figure 14 :
Figure 14: A calibration for similarity score using global alignment computation.

Figure 15 :
Figure 15: Example of how to define stretch and mutation rate

Figure 17 :
Figure 17: An example plot that define the stopping point of approximate repeat by Algorithm 6

Figure 18 :
Figure 18: Dot plot of recovered genomes against ground truth(according to index in Table 2) Figure 18: Dot plot of recovered genomes against ground truth(according to index in Table 2)

Table 1 :
1. Perform Step 1 to Step 7 of MultiBridging Algorithm 2. For every X-node x ∈ G string • a) Align all the relevant reads to the flanked repeat x • b) Consensus step: Consensus to find location of polymorphism by solving Eq (3) Calibration of error probability made by the phasing procedure of X-phased Multibridging

Table 3 :
Simulation results on the assembly of real/synthetic genomes using reads corrupted by indel noise(Synthetic: randomly generated to fit ˜ max , ˜ crit , crit ; (a) : Prochlorococcus marinus ; (b): Helicobacter pylori ) probabilistic model of the genome.It is shown through analysis to perform well on the model, and through simulations to perform well on real genomes.
Algorithm 4 Noisy Simple De Bruijn 0. Choose K to be max(l interleave , l triple )

Table 4 :
Simulation results on long contig creator(C s , L s are coverage and readlength for short reads.C l , L l are coverage and readlength for long reads.p del , p ins are the probability of insertion and deletion.G is the length of the genome.Homology is the number of SNPs divided by the length of the approximate repeat.l approx , l exact are the length of the approximate and exact repeat being studied. i∈S P (R i | T ), P err = P opt + δ 1