### Breaker: breaking inaccurate genome

After the functional decomposition of BIGMAC in the previous section, we are ready to investigate its first component: Breaker. We note that the goal of Breaker is to fix mis-assemblies. In order to achieve that, we need to collect sensible signals related to mis-assemblies and subsequently aggregate the signals to make the contig breaking decisions.

#### Signal Detector

Signal Detector collects three important signals related to mis-assemblies.

**Palindrome** We are interested in palindromes because of their relationship to a form of chimeric reads, the adaptor-skipped reads, which are common in today’s long read technology [8]. Since assemblers get stuck at these chimeric reads, the palindrome pattern in reads propagates to the corresponding contigs. Thus, the pattern of palindrome is a strong signal indicating mis-assemblies, particularly when the palindrome is long. A string tuple (*a*,*b*) is called a wrapping pair if the reverse complement of *a* is a prefix of *b* or the reverse complement of *b* is a suffix of *a*. *x* is called a palindrome if it is the concatenation of a wrapping pair (*a*,*b*), that is *x*=*a*
*b*. The wrapping length of *x* is max*x*=*a*
*b*,(*a*,*b*)is a wrapping pair min(*a*.*l*
*e*
*n*
*g*
*t*
*h*,*b*.*l*
*e*
*n*
*g*
*t*
*h*). For example, *x*=*A*
*C*
*G*
*G*
*C*
*C*
*G* is a palindrome of wrapping length 3; (*a*,*b*)=(*A*
*C*
*G*
*G*,*C*
*C*
*G*) is a wrapping pair because the reverse complement of *b* is *CGG*, which is a suffix of *a*. Since the minimum length of *a* and *b* is min(4,3)=3 and the wrapping length of *x* cannot exceed 3, the wrapping length for *x* is 3.

Signal Detector collects information about palindromes by aligning each contig against itself. It then identifies palindromes with long wrapping length because that indicates mis-assemblies. The corresponding palindromes’ information is then put into *S*
_{
palindrome
}. To improve sensitivity, Signal Detector allows approximate match when searching for palindromes. We note that approximate matches are determined by computing the edit distance of the corresponding strings. To determine approximate matches in BIGMAC, we use nucmer in MUMmer [7] with default parameters and with option –maxmatch. Two strings are considered as approximately matched when nucmer reports so.

**Repeat** Since long repeats confuse assemblers, their endpoints are possible candidates for mis-assemblies. Let *s*
_{1}=*a*
*x*
*b*,*s*
_{2}=*c*
*x*
*d*, a repeat between *s*
_{1},*s*
_{2} is specified by the endpoints of *x* in *s*
_{1},*s*
_{2}. For example, *s*
_{1}=*C*
*A*
*A*
*A*
*A*
*T*,*s*
_{2}=*G*
*A*
*A*
*A*
*A*
*G*, the endpoints of the repeat *AAAA* are the position specified by ! in *C*!*A*
*A*
*A*
*A*!*T*,*G*!*A*
*A*
*A*
*A*!*G*. Signal Detector aligns contigs against themselves to find the repeats. It then marks down the positions of the endpoints and puts them in a set called *S*
_{
repeat
}. To improve sensitivity, Signal Detector allows approximate matches when searching for repeats. We note that approximate matches are determined by computing the edit distance of the corresponding strings. Moreover, it only considers repeats that are maximal and are of significant length.

**Coverage** Significant coverage variation along contigs can correspond to false joins of sequences from different genomes with different abundances. Coverage profile is the coverage depth along the contigs. For example, the coverage profile along a string *s*=*A*
*C*
*G*
*T* is (1,2,2,1) if the reads are *A*
*C*,*C*
*G*,*G*
*T*. Signal detector aligns original reads to the contigs to find the coverage profile, which is called *S*
_{
coverage
}.

#### Signal Aggregator

After Signal Detector collects signals regarding palindromes *S*
_{
palindrome
}, repeats *S*
_{
repeat
} and coverage profile *S*
_{
coverage
}, Signal Aggregator uses them to determine the breakpoints on the input contigs *C*. The algorithm is summarized in Algorithm 1.

**ChimericContigFixing** The goal of ChimericContigFixing is to fix the contigs formed from chimeric reads. We simply break the palindromes in *S*
_{
palindrome
} at locations corresponding to their wrapping lengths. After removing redundant contigs, ChimericContigFixing returns the broken palindromes with the unchanged contigs.

**LocatePotentialMisassemblies** The goal of the subroutine LocatePotentialMisassemblies is to locate potential mis-assemblies caused by repeats. We study the design of this subroutine in this section.

**Motivating question and example** We can declare all the endpoints of approximate repeats in *S*
_{
repeat
} to be potential mis-assemblies. While this is a sensible baseline algorithm, it is not immediately clear whether it is sufficient or necessary. It is thus natural to consider the following question.

Given a set of contigs, how can we find the smallest set of locations on contigs to break so that the broken contigs are consistent with any reasonable ground truth? To illustrate our ideas, we consider an example with contigs *x*
_{1}=*a*
*b*
*c*
*d*
*e*,*x*
_{2}=*f*
*b*
*c*
*g*,*x*
_{3}=*h*
*c*
*d*
*i* with {*a*,*b*,*c*,*d*,*e*,*f*,*g*,*h*,*i*} being strings of equal length *L*.

The baseline algorithm of breaking contigs at the starting points of all the long(≥2*L*) repeats breaks the contigs 4 times (i.e. *a*|*b*|*c*
*d*
*e*,*f*|*b*
*c*
*g*,*h*|*c*
*d*
*i*). However, interestingly, we will show that one only need to break the contigs 3 times to preserve consistency (i.e. *x*
_{1}=*a*
*b*|*c*
*d*
*e*,*x*
_{2}=*f*
*b*|*c*
*g*,*x*
_{3}=*h*|*c*
*d*
*i*) and that is optimal.

**Modelling and problem formulation** We will formalize the notions of feasible break points, feasible ground truth, consistency between sets of contigs, sufficiency of break points to achieve consistency and the optimiality criterion.

We use a graph theoretic framework. Specifically, we study a directed graph *G*=(*V*,*E*) with *m* sources *S* and *m* sinks *T* where ∀*v*∉*S*∪*T*,*i*
*n*
*d*
*e*
*g*(*v*)=*o*
*u*
*t*
*d*
*e*
*g*(*v*) and parallel edges between two vertices are allowed. This is used to model a fully contracted De Bruijn graph formed by successive K-mers of the contigs. Vertices *V* are substrings of the contigs and edges *E* correspond to potentially mis-assembled locations on contigs. In our example, the set of vertices is *V*={*a*,*b*,*c*,*d*,*e*,*f*,*g*,*h*,*i*} and the set of edges is *E*={*a*
*b*,*f*
*b*,*b*
*c*
_{1},*b*
*c*
_{2},*h*
*c*,*c*
*d*
_{1},*c*
*d*
_{2},*c*
*g*,*d*
*e*,*d*
*i*}. We use subscripts to differentiate parallel edges joining the same vertices. The graph corresponding to our running example is shown in Fig. 3.

Given such a graph *G*, we note that *E* is the set of all **feasible break points** because each edge in the graph corresponds to a potentially mis-assembled location on contigs. A **feasible ground truth** corresponds to a set of *m* edge-disjoint source-to-sink trails that partitions the edge set *E*. For simplicity, we represent a trail as a sequence of the vertices in *G*, where the edges linking the vertices are ignored. For example, {*a*
*b*
*c*
*d*
*e*,*f*
*b*
*c*
*d*
*i*,*h*
*c*
*g*} is a feasible ground truth while {*a*
*b*
*c*
*g*,*f*
*g*
*d*
*e*,*h*
*c*
*d*
*i*} is another feasible ground truth. The set of all feasible ground truths is *GT*.

We recall that our high level goal is to choose a set of feasible break points *R*⊆*E* so that the broken contigs are consistent with any feasible ground truth. So, we need to define the notion of broken contigs and consistency between two sets of contigs under the current framework. Let *g*
*t*∈*G*
*T*, we denote *g*
*t*∖*R* be a set of trails after the removal of the edge set *R*. In particular, for the original contig set *C*∈*G*
*T*, *C*∖*R* is the set of broken contigs for the set of feasible break points *R*. For example, if *R*={*b*
*c*
_{1},*b*
*c*
_{2},*h*
*c*} and *C*={*a*
*b*
*c*
*d*
*e*,*f*
*b*
*c*
*d*
*i*,*h*
*c*
*g*}, *C*∖*R*={*a*
*b*,*c*
*d*
*e*,*f*
*b*,*c*
*d*
*i*,*h*,*c*
*g*}. To capture consistency between two sets of contigs, we use the following definition. Given two sets of trails *T*
_{1},*T*
_{2}, we say that *T*
_{1} is **consistent** with *T*
_{2} if ∀*x*∈*T*
_{1},∃*y*∈*T*
_{2}
*s*.*t*. *x*⊆*y* and ∀*x*
^{′}∈*T*
_{2},∃*y*
^{′}∈*T*
_{1}
*s*.*t*. *x*
^{′}⊆*y*
^{′}. We call *R* a **sufficient breaking set** with respect to (*C*,*G*
*T*) if ∀*g*
*t*∈*G*
*T*,*C*∖*R* is consistent with *g*
*t*∖*R*. In other words, *R* is a set of feasible break points that allows the broken contigs to be consistent with any feasible ground truth. Although this notion of sufficient breaking set is a natural model of the problem, it depends on the original contig set *C*, which is indeed not necessary. Specifically, we show that we have an equivalent definition of sufficient breaking set without referring back to the original contig set. Let us call *R* a sufficient breaking set with respect to *G*, or simply a sufficient breaking set, if ∀*g*
*t*
_{1},*g*
*t*
_{2}∈*G*
*T*,*g*
*t*
_{1}∖*R* is consistent with *g*
*t*
_{2}∖*R*.

###
**Proposition 1**

*R* is a sufficient breaking set with respect to (*C*,*G*
*T*) if and only if *R* a sufficient breaking set with respect to *G*.

###
*Proof*

The backward direction is immediate because *C*∈*G*
*T*. We will show the forward direction as follows. Let *g*
_{1},*g*
_{2}∈*G*
*T* and we want to show that *g*
_{1}∖*R* is consistent with *g*
_{2}∖*R*. Since *R* is a sufficient breaking set with respect to (*C*,*G*
*T*), *g*
_{1}∖*R* is consistent with *C*∖*R*. Therefore, ∀*x*∈*g*
_{1}∖*R*∃*y*∈*C*∖*R*
*s*.*t*. *x*⊆*y*. But since *g*
_{2}∖*R* is consistent with *C*∖*R*, we have ∃*z*∈*g*
_{2}∖*R*
*s*.*t*. *y*⊆*z*. By transitivity, we have *x*⊆*y*⊆*z*∈*g*
_{2}∖*R*. By symmetry, we can also show that ∀*x*
^{′}∈*g*
_{2}∖*R*∃*y*
^{′}∈*g*
_{1}∖*R*
*s*.*t*. *x*
^{′}⊆*y*
^{′}. Thus, *g*
_{1}∖*R* is consistent with *g*
_{2}∖*R*. □

Now, we state our **optimization criterion**. The goal here is to minimize the cardinality of the sufficient breaking set, formally as Eq. 1.

$$ {} \begin{aligned} OPT =& \min_{R\subseteq E} |R| \; s.t. \; R\; \text{is a sufficient breaking set with}\\[-5pt] &\,\text{respect to} G \end{aligned} $$

(1)

We will show that if we solve Eq. 1 for our running example, the answer is 3. This thus shows that the baseline algorithm of breaking contigs at all starting points (in our example, there are 4 of them) of all long repeats is not optimal.

###
**Proposition 2**

For our running example, OPT = 3.

###
*Proof*

First, by inspecting the 6 feasible ground truths in *GT*, we note *R*={*b*
*c*
_{1},*b*
*c*
_{2},*h*
*c*} is a sufficient breaking set with respect to *G*. Second, we note that the edge set need to disconnect sources and sinks, otherwise, we can find *g*
_{1},*g*
_{2}∈*G*
*T* such that *g*
_{1}∖*R*,*g*
_{2}∖*R* are inconsistent. This means |*R*| need to be ≥ mincut of the graph, which is 3. □

**Algorithm development and performance guarantee** Next we describe an algorithm that finds a sufficient breaking set with respect to *G*. Let us denote a boolean variable *b*
_{
e
} on each edge *e*∈*E*, with \(\vec {b} = (b_{e})_{e\in E}\). For *v*∈*V*, *I*
*n*
*E*
*d*
*g*
*e*
*s*(*v*),*O*
*u*
*t*
*E*
*d*
*g*
*e*
*s*(*v*) are the sets of incoming edges and outgoing edges at *v* respectively. *P*
*r*
*e*
*v*(*v*),*S*
*u*
*c*
*c*(*v*) are the sets of predecessor vertices and successor vertices of *v* respectively. Our algorithm solves the following minimization problem (Eq. 2) as a proxy to (Eq. 1).

$$ \min_{r \subseteq \vec{b}: r =True \Rightarrow \Phi(\vec{b}) = True} |r| $$

(2)

where,

$$ \begin{aligned} \Phi(\vec{b}) =& \wedge_{v: |Prev(v)| \ge 2~ \text{and~} |Succ(v)| \ge 2}\\ &\left(\wedge_{e\in InEdges(v)} b_{e} \vee \wedge_{e\in OutEdges(v)} b_{e} \right) \end{aligned} $$

(3)

In other words, it includes either all the incoming edges or all the outgoing edges for every vertices with at least 2 successors and at least 2 predecessors to *R*. We then seek *R* with minimum cardinality among the choices available. If *G* can be decomposed into connected components, we can optimize \(\Phi (\vec {b})\) independently on each connected component. In our implementation, if the size of the connected component is not too large, we optimize the objective function by exhaustion. Beyond a certain threshold, we simply output a feasible solution without optimizing. Indeed, in our experiments on real data, most of the connected components are not that large and this step typically takes a few minutes. But we remark that for more complex applications, one can further optimize the algorithm. For example, one can first topologically sort the vertices and then use dynamic programming to solve Eq. 2 along the sorted vertices.

We note that the algorithm described gives an optimal solution for our running example. Moreover, we show performance guarantee of the algorithm as follows.

###
**Proposition 3**

Solving *Eq.*
2 gives a sufficient breaking set *R* if the graph *G* is fully contracted.

###
*Proof*

Let *R* be the set of edges selected by the algorithm. For any two *g*
*t*
_{1},*g*
*t*
_{2}∈*G*
*T*, we want to show that *g*
*t*
_{1}∖*R* and *g*
*t*
_{2}∖*R* are consistent. By symmetry, it suffices to prove that if *x*∈*g*
*t*
_{1}∖*R*, then ∃*y*∈*g*
*t*
_{2}∖*R*
*s*.*t*. *x*⊆*y*. If |*x*|=2, it is immediate because every edge other than those in *R* are covered. If |*x*|≥3, we will show that it is also true using proof by contradiction. If ∀*y*∈*g*
*t*
_{2}∖*R*,*x*⫅̸*y*, we can find a sub-trail *x*
^{′}=(*a*
_{1},*a*
_{2},…,*a*
_{
k
},*a*
_{
k+1}) of *x* such that ∃*y*
^{′}∈*g*
*t*
_{2}∖*R*
*s*.*t*. *x*
^{″}=(*a*
_{1},…,*a*
_{
k
})⊆*y*
^{′} but ∀*y*∈*g*
*t*
_{2}∖*R*,*x*
^{′}⫅̸*y*. This implies ∃*a*
^{∗}≠*a*
_{
k+1}
*s*.*t*. (*x*
^{″},*a*
^{∗})⊆*z* for some *z*∈*g*
*t*
_{2}∖*R*. Since the edge (*a*
_{
k
},*a*
_{
k+1})⊆*x*∈*g*
*t*
_{1}∖*R*, we know that (*a*
_{
k
},*a*
_{
k+1}) is not in *R*. Similarly, (*a*
_{
k
},*a*
^{∗})∉*R* because (*a*
_{
k
},*a*
^{∗})⊆*y*
^{′}∈*g*
*t*
_{2}∖*R*. But since |*S*
*u*
*c*
*c*(*a*
_{
k
})|≥2, the fact that our algorithm does not include (*a*
_{
k
},*a*
^{∗}),(*a*
_{
k
},*a*
_{
k+1}) in *R* implies that |*P*
*r*
*e*
*d*(*a*
_{
k
})|=1, namely *P*
*r*
*e*
*d*(*a*
_{
k
})={*a*
_{
k−1}}. Note that we are considering a fully contracted graph. So, the fact that *a*
_{
k−1} exists implies that |*S*
*u*
*c*
*c*(*a*
_{
k−1})|≥2. But our algorithm has not removed edge (*a*
_{
k−1},*a*
_{
k
}). This gives |*P*
*r*
*e*
*d*(*a*
_{
k−1})|=1. Inductively, we get |*P*
*r*
*e*
*d*(*a*
_{
i
})|=1∀2≤*i*≤*k*. But we know that (*a*
_{
k
},*a*
_{
k+1})⊆*w* for some *w*∈*g*
*t*
_{2}∖*R*. Since the edges along (*a*
_{1},…,*a*
_{
k+1}) are not in *R*, this gives, *x*
^{′}=(*a*
_{1},…,*a*
_{
k+1})⊆*w*∈*g*
*t*
_{2}∖*R*. This contradicts the assumption about *x*
^{′}. □

###
**Proposition 4**

If the graph *G* is fully contracted DAG without parallel edges, then solving *Eq.*
2 returns a sufficient breaking set that is of optimal cardinality, *OPT*.

###
*Proof*

It suffices to show that for any sufficient breaking set *R*, ∀*v*∈*V* where |*S*
*u*
*c*
*c*(*v*)|≥2,|*P*
*r*
*e*
*d*(*v*)|≥2, we have either *I*
*n*
*E*
*d*
*g*
*e*
*s*(*v*)⊆*R* or *O*
*u*
*t*
*E*
*d*
*g*
*e*
*s*(*v*)⊆*R*. We prove it by contradiction. If not, ∃*v*∈*V* where |*S*
*u*
*c*
*c*(*v*)|≥2,|*P*
*r*
*e*
*d*(*v*)|≥2 but *I*
*n*
*E*
*d*
*g*
*e*
*s*(*v*)⫅̸*R* and *O*
*u*
*t*
*E*
*d*
*g*
*e*
*s*(*v*)⫅̸*R*. Because it is a DAG, it means we can find *g*
*t*
_{1}∈*G*
*T* such that ∃*x*,*y*,*x*
^{′},*y*
^{′} such that (*x*,*v*,*y*)∈*g*
*t*
_{1} and (*x*
^{′},*v*,*y*
^{′})∈*g*
*t*
_{1}. Now consider *g*
*t*
_{2} to be a clone of *g*
*t*
_{1} except that it has (*x*,*v*,*y*
^{′}),(*x*
^{′},*v*,*y*) instead of (*x*,*v*,*y*
^{′}),(*x*
^{′},*v*,*y*). We note that *g*
*t*
_{2}∈*G*
*T*. And because there are no parallel edges, (*x*,*v*,*y*) is not a subset of any of the trails in *g*
*t*
_{2}. So, we find two distinct *g*
*t*
_{1},*g*
*t*
_{2}∈*G*
*T* such that *g*
*t*
_{1},*g*
*t*
_{2} are not consistent. This contradicts the fact that *R* is a sufficient breaking set. □

It is noteworthy that if we are given any set of contigs from any *g*
*t*∈*G*
*T*, we still obtain the same set of broken contigs after the operation of removal of redundant trails, RemoveRedundant (i.e. we eliminate the contigs in a set *A* to form a minimal subset *B*⊆*A* in which ∀*x*≠*y*∈*B*,*x*⫅̸*y*). This can be formalized as follows.

###
**Proposition 5**

If *R* is a sufficient breaking set, then for any *g*
*t*
_{1},*g*
*t*
_{2}∈*G*
*T*,*R*
*e*
*m*
*o*
*v*
*e*
*R*
*e*
*d*
*u*
*n*
*d*
*a*
*n*
*t*(*g*
*t*
_{1}∖*R*)=*R*
*e*
*m*
*o*
*v*
*e*
*R*
*e*
*d*
*u*
*n*
*d*
*a*
*n*
*t*(*g*
*t*
_{2}∖*R*).

###
*Proof*

Let *s*
_{
i
}=*R*
*e*
*m*
*o*
*v*
*e*
*R*
*e*
*d*
*u*
*n*
*d*
*a*
*n*
*t*(*g*
*t*
_{
i
}∖*R*) for *i*∈{1,2}. By symmetry, it suffices to prove that *s*
_{1}⊆*s*
_{2}∀*x*∈*s*
_{1}⊆*g*
*t*
_{1}∖*R*,∃*y*∈*g*
*t*
_{2}∖*R*, such that *x*⊆*y*. Note that we can also find some *x*
^{∗}∈*s*
_{2} such that *y*⊆*x*
^{∗}. This gives *x*⊆*y*⊆*x*
^{∗}. However, since we have no redundant trails in *s*
_{1}, we get *x*=*x*
^{∗}. Thus *x*=*x*
^{∗}∈*s*
_{2}. □

To apply BIGMAC to real data, we have to implement the described algorithm with some further engineering. This includes methods to tolerate noise, to handle double stranded nature of DNA, and to construct De Bruijn graph from the repeats. Interestd readers can refer to the Additional file 1 : Appendix for these implementation details.

**ConfirmBreakPoints** The goal of ConfirmBreakPoints is to utilize the coverage profile *S*
_{
coverage
} to confirm breaking decisions at potentially mis-assembled locations specified in *S*
_{
filter
}. For the sake of simplicity, we now consider a string *s* of length *L*, and a set of positions *Y*={*y*
_{
i
}}_{1≤i≤k
} of *s* which are potential mis-assemblies. Furthermore, let us assume that all mis-assemblies are caused by mixing genomes of different abundances. Using *Y*, we can partition *s* into segments {*s*
_{
i
}}_{0≤i≤k
} of lengths respectively as {*ℓ*
_{
i
}}_{0≤i≤k
}. We let *x*
_{
i
} be the number of reads that start in segment *s*
_{
i
}, which can be estimated from *S*
_{
coverage
}. The question is how to classify the points in *Y* as true mis-assemblies or as fake mis-assemblies.

One can use heuristics, like comparing coverage depth difference before and after each *y*
_{
i
}. However, this is not ideal. For example, if we have coverage depth before and after *y*
_{1} differing by 1X, we would expect it to be a much stronger signal for true mis-assembly if the lengths *ℓ*
_{0},*ℓ*
_{1} are of order of 100 K rather than of 100 and this cannot be seen by considering coverage depth difference alone. For the case of just two segments of equal length and if we assume the *x*
_{
i
}’s are independent Poisson random variables, there is a popular statistical test, C-Test [9], that can make the decision. Formally, if *x*
_{1}∼*P*
*o*
*i*
*s*
*s*
*o*
*n*(*m*
_{1}) and *x*
_{2}∼*P*
*o*
*i*
*s*
*s*
*o*
*n*(*m*
_{2}), then C-Test is a test to decide between the hypothesis of *H*
_{0}:*m*
_{1}=*m*
_{2} against *H*
_{1}:*m*
_{1}≠*m*
_{2}. C-Test first considers *x*
_{1}+*x*
_{2} to compute the decision boundary and it then decides whether to reject *H*
_{0} based on *x*
_{1}/*x*
_{2} and the previously derived decision boundary. The intuition is that *x*
_{1}+*x*
_{2} corresponds to the amount of data, which determines the confidence of a statistical test. Thus, if *x*
_{1}+*x*
_{2} is large, a small perturbation of *x*
_{1}/*x*
_{2} from 1 can still be very significant and can correspond to a confident rejection of *H*
_{0}.

However, we still need to think carefully about how to apply C-Test to our problem. One method is to directly apply *k* independent C-Test on each of the junctions *y*
_{
i
}. However, that method cannot take advantage of the information from neighboring segments to boost the statistical power at a junction. Therefore, we develop the algorithm IterativeCTest. IterativeCTest performs traditional C-Test but in multiple iterations. Initially, it directly applies *k* independent C-Test on each of the junctions *y*
_{
i
}. Some of the segments are merged and we aggregate the counts from the merged segments to continue to the next C-Test at the remaining unmerged junctions in *Y*. This process is repeated until the algorithm converges. Finally, we use the breaking decisions from IterativeCTest to break the incoming contigs and return the fixed contigs.

### Merger: merging assembled contigs

After fixing mis-assemblies, we are ready to study another pillar of BIGMAC: Merger. Merger establishes connectivity among contigs and subsequently makes decisions to extend contigs.

#### Graph operator

The goal of Graph Operator is to identify candidates for contig extension. It forms and transforms a string graph using the overlap information among original reads and contigs. It subsequently analyzes the graph to find candidates for contig extension. The overall algorithm is summarized in Algorithm 2.

##### Mapping

We obtain mapping among contigs and reads. This provides the fundamental building block to construct the connectivity relationship among contigs and reads.

##### FormGraph

The goal of FormGraph is to establish connectivity among contigs. We use contig-read string graph as our primary data structure. Contig-read string graph is a string graph [10] with both the contigs and the reads associated with their endpoints as nodes. The size of the graph is thus manageable because most reads are not included in the graph. Then, we construct an overlay graph (which we call the contig graph) such that the nodes are the contigs with weights on nodes being the coverage depth of contigs. In the contig graph, there is an edge between two nodes if there is a sequence of reads between the associated contigs. With these data structures, we can switch between macroscopic and microscopic representation of the contig connectivity easily.

##### GraphSurgery

GraphSurgery simplifies the contig graph. This includes removal of transitive edge and edge contraction.

For nodes *u*,*v*,*w*, if we have edges *u*→*v*,*u*→*w* and *w*→*v*, then we call *u*→*v* a transitive edge. We perform transitive reduction [10] on the contig graph to remove transitive edges. Removing these transitive edges prevents us from finding false repeats in the next stage. To improve robustness, there is a pre-processing step before transitive reduction. If the contig *w* is too short and there are no reads that form head-to-tail overlap between *w*,*u* or *w*,*v*, then we can have a missing edge for transitive reduction to operate properly. Thus, we add the missing edge (either from *u* to *w* or *w* to *v*) back when we find contigs and related reads that suggest the missing edge might be there.

An edge *u*→*v* is contractable when the outgoing degree of *u* and the incoming degree of *v* are both 1. We contract edges to fill gaps. Our experience with FinisherSC is that data are dropped in the assembler and so reconsidering them as a post-processing step can potentially fill some gaps. However, there is a caveat. In establishing connectivity in contig-read string graph, we only use reads close to each contig’s endpoints (as a way to save computation resources), we may miss some legitimate connections. Therefore, very long repeats prevent the detection of linkage of contigs, thereby allow contigs to be erroneously merged. To address that, if there exists contig *w* that is connected (by some reads) to a region close to the right end of *u*/left end of *v*, then we avoid contraction of *u*→*v*.

##### FindExtensionCandidates

FindExtensionCandidates identifies candidates for contig extension by identifying local patterns in the contig graph. One form of extension candidates is a pair of contigs that are connected without competing partners. This corresponds to the contigs joined by a contractable edge. Another form of extension is a set of contigs that are connected with competing partners. This corresponds to the contigs linked together in the presence of repeats. In the contig graph, the repeat interior can either be represented as a separate node or not. If the repeat interior is represented as a separate node, the local subgraph is a star graph with the repeat node at the center. Otherwise, the local subgraph is a bipartite graph consisting of competing contigs. After identifying the contigs associated with a specific repeat, we can then merge contigs in the next stage.

#### Contig extender

After the operations by Graph Operator, we have extracted the potential contig extension candidates from the contig graph. It remains to decide whether and how to merge the corresponding contigs. In a high level, Contig Extender aims at solving the Contig Merging Decision Problem.

###
**Contig merging decision problem**

Given a set of incoming contigs *I* and a set of outgoing contigs *O* whose coverage depth and read connectivity information is known. Decide how to form an appropriate bipartite matching between *I* and *O*.

While we do not intend to rigorously define the statement of Contig Merging Decision Problem now, it is clear that appropriate matching corresponds to one that achieves high accuracy. Contig Extender carefully analyzes the read connectivity and contig coverage to make decisions on merging contigs. In the coming discussion, we first study an effective heuristic that captures the essence of the problem. After that, we will study how to rigorously define the Contig Merging Decision Problem in a mathematical form and suggest an EM-algorithm in solving that.

##### Heuristic in solving the contig merging decision problem

When the cardinality of the set of incoming contigs *I* and the set of outgoing contigs *O* are both 1, a natural decision is to merge them. Thus, the focus here is to study the case when |*I*|>1 or |*O*|>1. We select the reads that uniquely span one contig *a* in the incoming set and one contig *b* in the outgoing set. If there are multiple such reads, then we decide that *a*,*b* should be joined together provided that there does not exist any paths of reads that lead *a* to other contigs in the outgoing set and similarly for *b*. Moreover, we perform similar tests using contig coverage. If the coverage depth between two contigs is very close, they will be declared to be a potential merge pair. Then, we test whether there are any close runner-ups. If not, they should be merged. To combine the decisions made using spanning reads and coverage depth, we have a subroutine that discards all conflicting merges. We find that this heuristic for decision making works quite well in our datasets. However, in the coming discussion, we will study how to make the extension decisions in a more principled and unified manner.

##### General framework in solving the contig merging decision problem

The challenge for the Contig Merging Decision Problem is the tradeoff for many physical quantities (e.g. abundance, edit distance of reads, noise level, number of spanning reads, etc). We address this by defining a confidence score based on parameter estimation. For simplicity of discussion, we assume that *k* is the cardinality of both the set of incoming contigs and the set of outgoing contigs. The goal is to find the best perfect matching with respect to a confidence score.

Let *M* be a perfect matching of contigs among incoming and outgoing groups *I* and *O*. Under *M*, there are *k* merged contigs. Let the observation be the set of related reads *X*={*R*
_{
i
}∣1≤*i*≤*n*}. We define the parameters *θ*={(*λ*
_{
j
},*I*
_{
j
})_{1≤j≤k
}}, where *λ*
_{
j
} is the normalized abundance of contig *j* and *I*
_{
j
} is genomic content of the contig *j*. Note that \(\sum _{1\le j \le k} \lambda _{j} = 1\). So, the parameter estimation problem can be cast as *s*
_{
M
}= max*θ*
*P*
_{
θ
}(*X*∣*M*), where *s*
_{
M
} is the confidence score of the matching *M*. Finally, the best perfect matching can be found by comparing the corresponding confidence score.

Note that we have hidden variables *Z*=(*Z*
_{
i
})_{1≤i≤n
} which indicate the contigs that reads *X* are extracted from (i.e. *Z*
_{
i
}∈{1,2,…,*k*}). If we assume the length of the contig *j* to be *ℓ*
_{
j
} and *q* to be the indel noise level (i.e. probability of 1−2*q* to be remained unaltered at each location), then we can use an EM-algorithm to obtain an estimate of *θ*. Specifically, the expected value of the log likelihood function \(E_{q(Z\mid X, \theta ^{(t)})}\left [\log P_{\theta ^{(t)}}(X, Z, \theta ^{(t+1)} \right ]\) is

$${} \begin{aligned} \sum_{1\le i \le n} \sum_{1\le j\le k} M^{j}(R_{i}, I^{(t)}) &\left[\log \frac{\lambda_{j}^{(t+1)}}{\ell_{j}} + |R_{i}|\log(1-2q)\right.\\ &\left.+ d(R_{i}, I^{(t+1)})\log \frac{q}{1-2q} \vphantom{\log \frac{\lambda_{j}^{(t+1)}}{\ell_{j}} + |R_{i}|\log(1-2q)}\right] \end{aligned} $$

(4)

where \(M^{j}(R, I^{(t)}) = \delta _{j=argmin_{j} d(R, I^{(t)}_{j})}\) is the assignment of reads to the most similar contig (with tie breaking using *λ*
^{(t)}), *d*(*A*,*B*) is the best local alignment score, \(I^{(t)} = (I^{(t)}_{j})_{1\le j \le k}\) are the genomic contents of the contigs at iteration *t* and *λ*
^{(t)}=(*λ*
_{
j
})_{1≤j≤k
} are the estimated abundances at iteration *t*. By maximizing the log likelihood function with respect to *θ*
^{(t+1)}, we have the update formulas as

$$ \lambda^{(t+1)}_{j^{*}} = \frac{\sum_{1\le i\le n} M^{j^{*}}(R_{i}, I^{(t)})}{\sum_{1\le j\le k} \sum_{1\le i\le n} M^{j}(R_{i}, I^{(t)})} $$

(5)

$$ I^{(t+1)}_{j^{*}} = \underset{I}{\operatorname{argmin}} \sum_{1\le i \le n} M^{j^{*}}(R_{i}, I^{(t)})d(R_{i}, I) $$

(6)

Note that the update of \(\phantom {\dot {i}\!}I_{j^{*}}\) requires multiple sequence alignment. In general, the problem is NP-hard [11]. However, for noisy reads extracted from several contigs, the problem is not as difficult. For example, in the case of pure substitution noise, an efficient optimal solution is a column-wise majority vote. Despite the elegance and feasibility of this approach, it is still computationally more intense than the heuristic. Therefore, in our implementation of BIGMAC, the heuristic is the default method used in Contig Extender. But we also provide an experimental version of the EM-algorithm which can be used when users specify –option emalgo=True in their commands.