 Research Article
 Open Access
 Published:
BIGMAC : breaking inaccurate genomes and merging assembled contigs for long read metagenomic assembly
BMC Bioinformatics volume 17, Article number: 435 (2016)
Abstract
Background
The problem of denovo assembly for metagenomes using only long reads is gaining attention. We study whether postprocessing metagenomic assemblies with the original input long reads can result in quality improvement. Previous approaches have focused on preprocessing reads and optimizing assemblers. BIGMAC takes an alternative perspective to focus on the postprocessing step.
Results
Using both the assembled contigs and original long reads as input, BIGMAC first breaks the contigs at potentially misassembled locations and subsequently scaffolds contigs. Our experiments on metagenomes assembled from long reads show that BIGMAC can improve assembly quality by reducing the number of misassemblies while maintaining or increasing N50 and N75. Moreover, BIGMAC shows the largest N75 to number of misassemblies ratio on all tested datasets when compared to other postprocessing tools.
Conclusions
BIGMAC demonstrates the effectiveness of the postprocessing approach in improving the quality of metagenomic assemblies.
Background
Introduction
Denovo assembly is a fundamental yet difficult [1] computational problem in metagenomics. Indeed, there is currently an open challenge for metagenomic assembly using short reads, titled “Critical Assessment of Metagenomic Interpretation (CAMI [2]).” On the other hand, emerging sequencing technologies can provide extra information and make the computational problem more tractable. For example, long reads are increasingly being shown to be valuable in the denovo assembly of single genomes [3]. Therefore, it is natural to study metagenomic assembly using long reads. Current assembly approaches for long reads focus on developing more optimized assemblers to leverage the power of the data. However, significant engineering effort is usually involved to build an endtoend assembler.
We take a different perspective, focusing the design effort on a postprocessor that improves assembled contigs using the original long read data (Fig. 1). The main question is whether we can achieve quality improvement with this approach using typical long reads. This postpocessing approach is attractive because it leverages existing software. Consequently, we can focus design effort and computational resources to specifically address thorny issues arising from the nature of new data, complex repeats, varying abundances and noise. Moreover, since the long reads are part of the input, the postprocessor can bring back information missed upstream. In single genome assembly, FinisherSC [4] has demonstrated the effectiveness of this approach. In this paper, we investigate the effectiveness of this postprocessing approach for metagenomic assembly.
BIGMAC is a postprocessor to improve metagenomic assemblies with long read only data. It first breaks contigs at potentially misassembled locations and subsequently scaffolds contigs. In our experiments, BIGMAC demonstrates promising results on several mock communities using data from the Pacific Biosciences long read sequencer. Inputs to BIGMAC include assembled contigs from HGAP [5] and the original raw reads with adapters removed. After assembly and postprocessing, we use QUAST [6] to evaluate and compare the assembly quality of contigs before and after using BIGMAC. As shown in Fig. 1, BIGMAC improves the quality of the contigs by reducing the number of misassemblies while maintaining/increasing N50 and N75. This demonstrates the effectiveness of the postprocessing approach for metagenomic assembly with long reads.
A topdown design of BIGMAC
We use a hypothetical yet representative set of input data to illustrate the design of BIGMAC in a topdown manner. Let g _{1},g _{2} be two genomes of abundances λ _{1},λ _{2} respectively. Assume that they share a long repeat in the middle, that is, g _{1}=x _{1} r y _{1},g _{2}=x _{2} r y _{2}. Unfortunately, an upstream assembler misassembles the reads and produces two contigs c _{1},c _{2} with incorrect joins at the repeat. That is, c _{1}=x _{1} r y _{2},c _{2}=x _{2} r y _{1}. As an assembly postprocessor, BIGMAC takes the misassembled contigs c _{1},c _{2} and original reads as input. Its goal is to reproduce g _{1},g _{2}. To achieve this, we immediately recognize that there should be components for fixing misassemblies and scaffolding contigs. In BIGMAC, they are respectively Breaker and Merger. An illustration is given in Fig. 2.
Breaker is further divided into two subcomponents: Signal Detector and Signal Aggregator. Signal Detector collects signals that indicates misassemblies and Signal Aggregator subsequently makes decisions on breaking contigs based on the collected signals. In our example, the coverage fluctuation between λ _{1},λ _{2} along the contigs c _{1},c _{2} and the presence of long repeat r are useful signals that Signal Detector collects. After aggregating these signals, Signal Aggregator decides on breaking both the contigs c _{1} and c _{2} at the starting points of the repeat r. Therefore, the output contigs of Breaker are x _{1},x _{2},r y _{1},r y _{2}.
Merger is also divided into two subcomponents: Graph Operator and Contig Extender. With information from the original reads, Graph Operator establishes connectivity of the input contigs using string graphs. Then, based on the evidence from spanning reads and contig coverage, Contig Extender extends input contigs into longer contigs. In our example, the input contigs to Merger are x _{1},x _{2},r y _{1},r y _{2}. Graph Operator forms a string graph with edges x _{1}→r y _{1},x _{1}→r y _{2},x _{2}→r y _{1} and x _{2}→r y _{2}. Contig Extender analyzes the coverage depth of the related contigs and decides to merge contigs to form x _{1} r y _{1} and x _{2} r y _{2}, thus reproducing the correct genomes.
Finally, we remark that BIGMAC uses overlap information between reads and contigs or among contigs. In our implementation, the overlap information is provided by running MUMmer [7] on appropriate strings. However, we note that one can implement the pipeline by replacing MUMmer with other aligners.
Methods
Breaker: breaking inaccurate genome
After the functional decomposition of BIGMAC in the previous section, we are ready to investigate its first component: Breaker. We note that the goal of Breaker is to fix misassemblies. In order to achieve that, we need to collect sensible signals related to misassemblies and subsequently aggregate the signals to make the contig breaking decisions.
Signal Detector
Signal Detector collects three important signals related to misassemblies.
Palindrome We are interested in palindromes because of their relationship to a form of chimeric reads, the adaptorskipped reads, which are common in today’s long read technology [8]. Since assemblers get stuck at these chimeric reads, the palindrome pattern in reads propagates to the corresponding contigs. Thus, the pattern of palindrome is a strong signal indicating misassemblies, particularly when the palindrome is long. A string tuple (a,b) is called a wrapping pair if the reverse complement of a is a prefix of b or the reverse complement of b is a suffix of a. x is called a palindrome if it is the concatenation of a wrapping pair (a,b), that is x=a b. The wrapping length of x is maxx=a b,(a,b)is a wrapping pair min(a.l e n g t h,b.l e n g t h). For example, x=A C G G C C G is a palindrome of wrapping length 3; (a,b)=(A C G G,C C G) is a wrapping pair because the reverse complement of b is CGG, which is a suffix of a. Since the minimum length of a and b is min(4,3)=3 and the wrapping length of x cannot exceed 3, the wrapping length for x is 3.
Signal Detector collects information about palindromes by aligning each contig against itself. It then identifies palindromes with long wrapping length because that indicates misassemblies. The corresponding palindromes’ information is then put into S _{ palindrome }. To improve sensitivity, Signal Detector allows approximate match when searching for palindromes. We note that approximate matches are determined by computing the edit distance of the corresponding strings. To determine approximate matches in BIGMAC, we use nucmer in MUMmer [7] with default parameters and with option –maxmatch. Two strings are considered as approximately matched when nucmer reports so.
Repeat Since long repeats confuse assemblers, their endpoints are possible candidates for misassemblies. Let s _{1}=a x b,s _{2}=c x d, a repeat between s _{1},s _{2} is specified by the endpoints of x in s _{1},s _{2}. For example, s _{1}=C A A A A T,s _{2}=G A A A A G, the endpoints of the repeat AAAA are the position specified by ! in C!A A A A!T,G!A A A A!G. Signal Detector aligns contigs against themselves to find the repeats. It then marks down the positions of the endpoints and puts them in a set called S _{ repeat }. To improve sensitivity, Signal Detector allows approximate matches when searching for repeats. We note that approximate matches are determined by computing the edit distance of the corresponding strings. Moreover, it only considers repeats that are maximal and are of significant length.
Coverage Significant coverage variation along contigs can correspond to false joins of sequences from different genomes with different abundances. Coverage profile is the coverage depth along the contigs. For example, the coverage profile along a string s=A C G T is (1,2,2,1) if the reads are A C,C G,G T. Signal detector aligns original reads to the contigs to find the coverage profile, which is called S _{ coverage }.
Signal Aggregator
After Signal Detector collects signals regarding palindromes S _{ palindrome }, repeats S _{ repeat } and coverage profile S _{ coverage }, Signal Aggregator uses them to determine the breakpoints on the input contigs C. The algorithm is summarized in Algorithm 1.
ChimericContigFixing The goal of ChimericContigFixing is to fix the contigs formed from chimeric reads. We simply break the palindromes in S _{ palindrome } at locations corresponding to their wrapping lengths. After removing redundant contigs, ChimericContigFixing returns the broken palindromes with the unchanged contigs.
LocatePotentialMisassemblies The goal of the subroutine LocatePotentialMisassemblies is to locate potential misassemblies caused by repeats. We study the design of this subroutine in this section.
Motivating question and example We can declare all the endpoints of approximate repeats in S _{ repeat } to be potential misassemblies. While this is a sensible baseline algorithm, it is not immediately clear whether it is sufficient or necessary. It is thus natural to consider the following question.
Given a set of contigs, how can we find the smallest set of locations on contigs to break so that the broken contigs are consistent with any reasonable ground truth? To illustrate our ideas, we consider an example with contigs x _{1}=a b c d e,x _{2}=f b c g,x _{3}=h c d i with {a,b,c,d,e,f,g,h,i} being strings of equal length L.
The baseline algorithm of breaking contigs at the starting points of all the long(≥2L) repeats breaks the contigs 4 times (i.e. abc d e,fb c g,hc d i). However, interestingly, we will show that one only need to break the contigs 3 times to preserve consistency (i.e. x _{1}=a bc d e,x _{2}=f bc g,x _{3}=hc d i) and that is optimal.
Modelling and problem formulation We will formalize the notions of feasible break points, feasible ground truth, consistency between sets of contigs, sufficiency of break points to achieve consistency and the optimiality criterion.
We use a graph theoretic framework. Specifically, we study a directed graph G=(V,E) with m sources S and m sinks T where ∀v∉S∪T,i n d e g(v)=o u t d e g(v) and parallel edges between two vertices are allowed. This is used to model a fully contracted De Bruijn graph formed by successive Kmers of the contigs. Vertices V are substrings of the contigs and edges E correspond to potentially misassembled locations on contigs. In our example, the set of vertices is V={a,b,c,d,e,f,g,h,i} and the set of edges is E={a b,f b,b c _{1},b c _{2},h c,c d _{1},c d _{2},c g,d e,d i}. We use subscripts to differentiate parallel edges joining the same vertices. The graph corresponding to our running example is shown in Fig. 3.
Given such a graph G, we note that E is the set of all feasible break points because each edge in the graph corresponds to a potentially misassembled location on contigs. A feasible ground truth corresponds to a set of m edgedisjoint sourcetosink trails that partitions the edge set E. For simplicity, we represent a trail as a sequence of the vertices in G, where the edges linking the vertices are ignored. For example, {a b c d e,f b c d i,h c g} is a feasible ground truth while {a b c g,f g d e,h c d i} is another feasible ground truth. The set of all feasible ground truths is GT.
We recall that our high level goal is to choose a set of feasible break points R⊆E so that the broken contigs are consistent with any feasible ground truth. So, we need to define the notion of broken contigs and consistency between two sets of contigs under the current framework. Let g t∈G T, we denote g t∖R be a set of trails after the removal of the edge set R. In particular, for the original contig set C∈G T, C∖R is the set of broken contigs for the set of feasible break points R. For example, if R={b c _{1},b c _{2},h c} and C={a b c d e,f b c d i,h c g}, C∖R={a b,c d e,f b,c d i,h,c g}. To capture consistency between two sets of contigs, we use the following definition. Given two sets of trails T _{1},T _{2}, we say that T _{1} is consistent with T _{2} if ∀x∈T _{1},∃y∈T _{2} s.t. x⊆y and ∀x ^{′}∈T _{2},∃y ^{′}∈T _{1} s.t. x ^{′}⊆y ^{′}. We call R a sufficient breaking set with respect to (C,G T) if ∀g t∈G T,C∖R is consistent with g t∖R. In other words, R is a set of feasible break points that allows the broken contigs to be consistent with any feasible ground truth. Although this notion of sufficient breaking set is a natural model of the problem, it depends on the original contig set C, which is indeed not necessary. Specifically, we show that we have an equivalent definition of sufficient breaking set without referring back to the original contig set. Let us call R a sufficient breaking set with respect to G, or simply a sufficient breaking set, if ∀g t _{1},g t _{2}∈G T,g t _{1}∖R is consistent with g t _{2}∖R.
Proposition 1
R is a sufficient breaking set with respect to (C,G T) if and only if R a sufficient breaking set with respect to G.
Proof
The backward direction is immediate because C∈G T. We will show the forward direction as follows. Let g _{1},g _{2}∈G T and we want to show that g _{1}∖R is consistent with g _{2}∖R. Since R is a sufficient breaking set with respect to (C,G T), g _{1}∖R is consistent with C∖R. Therefore, ∀x∈g _{1}∖R∃y∈C∖R s.t. x⊆y. But since g _{2}∖R is consistent with C∖R, we have ∃z∈g _{2}∖R s.t. y⊆z. By transitivity, we have x⊆y⊆z∈g _{2}∖R. By symmetry, we can also show that ∀x ^{′}∈g _{2}∖R∃y ^{′}∈g _{1}∖R s.t. x ^{′}⊆y ^{′}. Thus, g _{1}∖R is consistent with g _{2}∖R. □
Now, we state our optimization criterion. The goal here is to minimize the cardinality of the sufficient breaking set, formally as Eq. 1.
We will show that if we solve Eq. 1 for our running example, the answer is 3. This thus shows that the baseline algorithm of breaking contigs at all starting points (in our example, there are 4 of them) of all long repeats is not optimal.
Proposition 2
For our running example, OPT = 3.
Proof
First, by inspecting the 6 feasible ground truths in GT, we note R={b c _{1},b c _{2},h c} is a sufficient breaking set with respect to G. Second, we note that the edge set need to disconnect sources and sinks, otherwise, we can find g _{1},g _{2}∈G T such that g _{1}∖R,g _{2}∖R are inconsistent. This means R need to be ≥ mincut of the graph, which is 3. □
Algorithm development and performance guarantee Next we describe an algorithm that finds a sufficient breaking set with respect to G. Let us denote a boolean variable b _{ e } on each edge e∈E, with \(\vec {b} = (b_{e})_{e\in E}\). For v∈V, I n E d g e s(v),O u t E d g e s(v) are the sets of incoming edges and outgoing edges at v respectively. P r e v(v),S u c c(v) are the sets of predecessor vertices and successor vertices of v respectively. Our algorithm solves the following minimization problem (Eq. 2) as a proxy to (Eq. 1).
where,
In other words, it includes either all the incoming edges or all the outgoing edges for every vertices with at least 2 successors and at least 2 predecessors to R. We then seek R with minimum cardinality among the choices available. If G can be decomposed into connected components, we can optimize \(\Phi (\vec {b})\) independently on each connected component. In our implementation, if the size of the connected component is not too large, we optimize the objective function by exhaustion. Beyond a certain threshold, we simply output a feasible solution without optimizing. Indeed, in our experiments on real data, most of the connected components are not that large and this step typically takes a few minutes. But we remark that for more complex applications, one can further optimize the algorithm. For example, one can first topologically sort the vertices and then use dynamic programming to solve Eq. 2 along the sorted vertices.
We note that the algorithm described gives an optimal solution for our running example. Moreover, we show performance guarantee of the algorithm as follows.
Proposition 3
Solving Eq. 2 gives a sufficient breaking set R if the graph G is fully contracted.
Proof
Let R be the set of edges selected by the algorithm. For any two g t _{1},g t _{2}∈G T, we want to show that g t _{1}∖R and g t _{2}∖R are consistent. By symmetry, it suffices to prove that if x∈g t _{1}∖R, then ∃y∈g t _{2}∖R s.t. x⊆y. If x=2, it is immediate because every edge other than those in R are covered. If x≥3, we will show that it is also true using proof by contradiction. If ∀y∈g t _{2}∖R,x⫅̸y, we can find a subtrail x ^{′}=(a _{1},a _{2},…,a _{ k },a _{ k+1}) of x such that ∃y ^{′}∈g t _{2}∖R s.t. x ^{″}=(a _{1},…,a _{ k })⊆y ^{′} but ∀y∈g t _{2}∖R,x ^{′}⫅̸y. This implies ∃a ^{∗}≠a _{ k+1} s.t. (x ^{″},a ^{∗})⊆z for some z∈g t _{2}∖R. Since the edge (a _{ k },a _{ k+1})⊆x∈g t _{1}∖R, we know that (a _{ k },a _{ k+1}) is not in R. Similarly, (a _{ k },a ^{∗})∉R because (a _{ k },a ^{∗})⊆y ^{′}∈g t _{2}∖R. But since S u c c(a _{ k })≥2, the fact that our algorithm does not include (a _{ k },a ^{∗}),(a _{ k },a _{ k+1}) in R implies that P r e d(a _{ k })=1, namely P r e d(a _{ k })={a _{ k−1}}. Note that we are considering a fully contracted graph. So, the fact that a _{ k−1} exists implies that S u c c(a _{ k−1})≥2. But our algorithm has not removed edge (a _{ k−1},a _{ k }). This gives P r e d(a _{ k−1})=1. Inductively, we get P r e d(a _{ i })=1∀2≤i≤k. But we know that (a _{ k },a _{ k+1})⊆w for some w∈g t _{2}∖R. Since the edges along (a _{1},…,a _{ k+1}) are not in R, this gives, x ^{′}=(a _{1},…,a _{ k+1})⊆w∈g t _{2}∖R. This contradicts the assumption about x ^{′}. □
Proposition 4
If the graph G is fully contracted DAG without parallel edges, then solving Eq. 2 returns a sufficient breaking set that is of optimal cardinality, OPT.
Proof
It suffices to show that for any sufficient breaking set R, ∀v∈V where S u c c(v)≥2,P r e d(v)≥2, we have either I n E d g e s(v)⊆R or O u t E d g e s(v)⊆R. We prove it by contradiction. If not, ∃v∈V where S u c c(v)≥2,P r e d(v)≥2 but I n E d g e s(v)⫅̸R and O u t E d g e s(v)⫅̸R. Because it is a DAG, it means we can find g t _{1}∈G T such that ∃x,y,x ^{′},y ^{′} such that (x,v,y)∈g t _{1} and (x ^{′},v,y ^{′})∈g t _{1}. Now consider g t _{2} to be a clone of g t _{1} except that it has (x,v,y ^{′}),(x ^{′},v,y) instead of (x,v,y ^{′}),(x ^{′},v,y). We note that g t _{2}∈G T. And because there are no parallel edges, (x,v,y) is not a subset of any of the trails in g t _{2}. So, we find two distinct g t _{1},g t _{2}∈G T such that g t _{1},g t _{2} are not consistent. This contradicts the fact that R is a sufficient breaking set. □
It is noteworthy that if we are given any set of contigs from any g t∈G T, we still obtain the same set of broken contigs after the operation of removal of redundant trails, RemoveRedundant (i.e. we eliminate the contigs in a set A to form a minimal subset B⊆A in which ∀x≠y∈B,x⫅̸y). This can be formalized as follows.
Proposition 5
If R is a sufficient breaking set, then for any g t _{1},g t _{2}∈G T,R e m o v e R e d u n d a n t(g t _{1}∖R)=R e m o v e R e d u n d a n t(g t _{2}∖R).
Proof
Let s _{ i }=R e m o v e R e d u n d a n t(g t _{ i }∖R) for i∈{1,2}. By symmetry, it suffices to prove that s _{1}⊆s _{2}∀x∈s _{1}⊆g t _{1}∖R,∃y∈g t _{2}∖R, such that x⊆y. Note that we can also find some x ^{∗}∈s _{2} such that y⊆x ^{∗}. This gives x⊆y⊆x ^{∗}. However, since we have no redundant trails in s _{1}, we get x=x ^{∗}. Thus x=x ^{∗}∈s _{2}. □
To apply BIGMAC to real data, we have to implement the described algorithm with some further engineering. This includes methods to tolerate noise, to handle double stranded nature of DNA, and to construct De Bruijn graph from the repeats. Interestd readers can refer to the Additional file 1 : Appendix for these implementation details.
ConfirmBreakPoints The goal of ConfirmBreakPoints is to utilize the coverage profile S _{ coverage } to confirm breaking decisions at potentially misassembled locations specified in S _{ filter }. For the sake of simplicity, we now consider a string s of length L, and a set of positions Y={y _{ i }}_{1≤i≤k } of s which are potential misassemblies. Furthermore, let us assume that all misassemblies are caused by mixing genomes of different abundances. Using Y, we can partition s into segments {s _{ i }}_{0≤i≤k } of lengths respectively as {ℓ _{ i }}_{0≤i≤k }. We let x _{ i } be the number of reads that start in segment s _{ i }, which can be estimated from S _{ coverage }. The question is how to classify the points in Y as true misassemblies or as fake misassemblies.
One can use heuristics, like comparing coverage depth difference before and after each y _{ i }. However, this is not ideal. For example, if we have coverage depth before and after y _{1} differing by 1X, we would expect it to be a much stronger signal for true misassembly if the lengths ℓ _{0},ℓ _{1} are of order of 100 K rather than of 100 and this cannot be seen by considering coverage depth difference alone. For the case of just two segments of equal length and if we assume the x _{ i }’s are independent Poisson random variables, there is a popular statistical test, CTest [9], that can make the decision. Formally, if x _{1}∼P o i s s o n(m _{1}) and x _{2}∼P o i s s o n(m _{2}), then CTest is a test to decide between the hypothesis of H _{0}:m _{1}=m _{2} against H _{1}:m _{1}≠m _{2}. CTest first considers x _{1}+x _{2} to compute the decision boundary and it then decides whether to reject H _{0} based on x _{1}/x _{2} and the previously derived decision boundary. The intuition is that x _{1}+x _{2} corresponds to the amount of data, which determines the confidence of a statistical test. Thus, if x _{1}+x _{2} is large, a small perturbation of x _{1}/x _{2} from 1 can still be very significant and can correspond to a confident rejection of H _{0}.
However, we still need to think carefully about how to apply CTest to our problem. One method is to directly apply k independent CTest on each of the junctions y _{ i }. However, that method cannot take advantage of the information from neighboring segments to boost the statistical power at a junction. Therefore, we develop the algorithm IterativeCTest. IterativeCTest performs traditional CTest but in multiple iterations. Initially, it directly applies k independent CTest on each of the junctions y _{ i }. Some of the segments are merged and we aggregate the counts from the merged segments to continue to the next CTest at the remaining unmerged junctions in Y. This process is repeated until the algorithm converges. Finally, we use the breaking decisions from IterativeCTest to break the incoming contigs and return the fixed contigs.
Merger: merging assembled contigs
After fixing misassemblies, we are ready to study another pillar of BIGMAC: Merger. Merger establishes connectivity among contigs and subsequently makes decisions to extend contigs.
Graph operator
The goal of Graph Operator is to identify candidates for contig extension. It forms and transforms a string graph using the overlap information among original reads and contigs. It subsequently analyzes the graph to find candidates for contig extension. The overall algorithm is summarized in Algorithm 2.
Mapping
We obtain mapping among contigs and reads. This provides the fundamental building block to construct the connectivity relationship among contigs and reads.
FormGraph
The goal of FormGraph is to establish connectivity among contigs. We use contigread string graph as our primary data structure. Contigread string graph is a string graph [10] with both the contigs and the reads associated with their endpoints as nodes. The size of the graph is thus manageable because most reads are not included in the graph. Then, we construct an overlay graph (which we call the contig graph) such that the nodes are the contigs with weights on nodes being the coverage depth of contigs. In the contig graph, there is an edge between two nodes if there is a sequence of reads between the associated contigs. With these data structures, we can switch between macroscopic and microscopic representation of the contig connectivity easily.
GraphSurgery
GraphSurgery simplifies the contig graph. This includes removal of transitive edge and edge contraction.
For nodes u,v,w, if we have edges u→v,u→w and w→v, then we call u→v a transitive edge. We perform transitive reduction [10] on the contig graph to remove transitive edges. Removing these transitive edges prevents us from finding false repeats in the next stage. To improve robustness, there is a preprocessing step before transitive reduction. If the contig w is too short and there are no reads that form headtotail overlap between w,u or w,v, then we can have a missing edge for transitive reduction to operate properly. Thus, we add the missing edge (either from u to w or w to v) back when we find contigs and related reads that suggest the missing edge might be there.
An edge u→v is contractable when the outgoing degree of u and the incoming degree of v are both 1. We contract edges to fill gaps. Our experience with FinisherSC is that data are dropped in the assembler and so reconsidering them as a postprocessing step can potentially fill some gaps. However, there is a caveat. In establishing connectivity in contigread string graph, we only use reads close to each contig’s endpoints (as a way to save computation resources), we may miss some legitimate connections. Therefore, very long repeats prevent the detection of linkage of contigs, thereby allow contigs to be erroneously merged. To address that, if there exists contig w that is connected (by some reads) to a region close to the right end of u/left end of v, then we avoid contraction of u→v.
FindExtensionCandidates
FindExtensionCandidates identifies candidates for contig extension by identifying local patterns in the contig graph. One form of extension candidates is a pair of contigs that are connected without competing partners. This corresponds to the contigs joined by a contractable edge. Another form of extension is a set of contigs that are connected with competing partners. This corresponds to the contigs linked together in the presence of repeats. In the contig graph, the repeat interior can either be represented as a separate node or not. If the repeat interior is represented as a separate node, the local subgraph is a star graph with the repeat node at the center. Otherwise, the local subgraph is a bipartite graph consisting of competing contigs. After identifying the contigs associated with a specific repeat, we can then merge contigs in the next stage.
Contig extender
After the operations by Graph Operator, we have extracted the potential contig extension candidates from the contig graph. It remains to decide whether and how to merge the corresponding contigs. In a high level, Contig Extender aims at solving the Contig Merging Decision Problem.
Contig merging decision problem
Given a set of incoming contigs I and a set of outgoing contigs O whose coverage depth and read connectivity information is known. Decide how to form an appropriate bipartite matching between I and O.
While we do not intend to rigorously define the statement of Contig Merging Decision Problem now, it is clear that appropriate matching corresponds to one that achieves high accuracy. Contig Extender carefully analyzes the read connectivity and contig coverage to make decisions on merging contigs. In the coming discussion, we first study an effective heuristic that captures the essence of the problem. After that, we will study how to rigorously define the Contig Merging Decision Problem in a mathematical form and suggest an EMalgorithm in solving that.
Heuristic in solving the contig merging decision problem
When the cardinality of the set of incoming contigs I and the set of outgoing contigs O are both 1, a natural decision is to merge them. Thus, the focus here is to study the case when I>1 or O>1. We select the reads that uniquely span one contig a in the incoming set and one contig b in the outgoing set. If there are multiple such reads, then we decide that a,b should be joined together provided that there does not exist any paths of reads that lead a to other contigs in the outgoing set and similarly for b. Moreover, we perform similar tests using contig coverage. If the coverage depth between two contigs is very close, they will be declared to be a potential merge pair. Then, we test whether there are any close runnerups. If not, they should be merged. To combine the decisions made using spanning reads and coverage depth, we have a subroutine that discards all conflicting merges. We find that this heuristic for decision making works quite well in our datasets. However, in the coming discussion, we will study how to make the extension decisions in a more principled and unified manner.
General framework in solving the contig merging decision problem
The challenge for the Contig Merging Decision Problem is the tradeoff for many physical quantities (e.g. abundance, edit distance of reads, noise level, number of spanning reads, etc). We address this by defining a confidence score based on parameter estimation. For simplicity of discussion, we assume that k is the cardinality of both the set of incoming contigs and the set of outgoing contigs. The goal is to find the best perfect matching with respect to a confidence score.
Let M be a perfect matching of contigs among incoming and outgoing groups I and O. Under M, there are k merged contigs. Let the observation be the set of related reads X={R _{ i }∣1≤i≤n}. We define the parameters θ={(λ _{ j },I _{ j })_{1≤j≤k }}, where λ _{ j } is the normalized abundance of contig j and I _{ j } is genomic content of the contig j. Note that \(\sum _{1\le j \le k} \lambda _{j} = 1\). So, the parameter estimation problem can be cast as s _{ M }= maxθ P _{ θ }(X∣M), where s _{ M } is the confidence score of the matching M. Finally, the best perfect matching can be found by comparing the corresponding confidence score.
Note that we have hidden variables Z=(Z _{ i })_{1≤i≤n } which indicate the contigs that reads X are extracted from (i.e. Z _{ i }∈{1,2,…,k}). If we assume the length of the contig j to be ℓ _{ j } and q to be the indel noise level (i.e. probability of 1−2q to be remained unaltered at each location), then we can use an EMalgorithm to obtain an estimate of θ. Specifically, the expected value of the log likelihood function \(E_{q(Z\mid X, \theta ^{(t)})}\left [\log P_{\theta ^{(t)}}(X, Z, \theta ^{(t+1)} \right ]\) is
where \(M^{j}(R, I^{(t)}) = \delta _{j=argmin_{j} d(R, I^{(t)}_{j})}\) is the assignment of reads to the most similar contig (with tie breaking using λ ^{(t)}), d(A,B) is the best local alignment score, \(I^{(t)} = (I^{(t)}_{j})_{1\le j \le k}\) are the genomic contents of the contigs at iteration t and λ ^{(t)}=(λ _{ j })_{1≤j≤k } are the estimated abundances at iteration t. By maximizing the log likelihood function with respect to θ ^{(t+1)}, we have the update formulas as
Note that the update of \(\phantom {\dot {i}\!}I_{j^{*}}\) requires multiple sequence alignment. In general, the problem is NPhard [11]. However, for noisy reads extracted from several contigs, the problem is not as difficult. For example, in the case of pure substitution noise, an efficient optimal solution is a columnwise majority vote. Despite the elegance and feasibility of this approach, it is still computationally more intense than the heuristic. Therefore, in our implementation of BIGMAC, the heuristic is the default method used in Contig Extender. But we also provide an experimental version of the EMalgorithm which can be used when users specify –option emalgo=True in their commands.
Results
Endtoend experiments
Synthetic data
We verify that BIGMAC correctly improves the incoming contigs using the following synthetic data. We generate two synthetic species of different abundances \((\frac {2}{7}, \frac {5}{7})\). They are composed of random nucleotide sequences of length 5 M each and share a common segment of length 12 K. We uniformly sample indel noise corrupted reads of length 6 K from the genomes, with coverage depth of 20X and 50X respectively. We also artificially construct misassembled contigs by switching the genome segments at the 12 K repeat.
The contigs and the reads are passed through BIGMAC. BIGMAC breaks the contigs at the misassembled point and joins them back correctly. One can see the sample run on [12]. This is also the example that we discuss in the topdown design of BIGMAC.
Real data
We test the performance of BIGMAC in improving metagenomic assembly on PacBio only data. We use different datasets of different characteristics. Dataset 1 consists of a mock community of 5 species [13], with genomes of high similarity. Dataset 2 consists of a mock community of 23 species [14], with genomes of diverse abundances. We also remark that we have tested BIGMAC on a third PacBio only dataset (Dataset 3): a real experiment involving Desulfuromonas biwabikus, D. soudanensis and some other unknown genomes. We note that we know the complete ground truth for the metagenomes in both Dataset 1 and 2 but only know part of the ground truth for Dataset 3. We take the output of HGAP and use the raw reads to improve them using BIGMAC. We observe that in these datasets, BIGMAC reduces the number of misassemblies while maintaining/increasing N50 and N75. The results of BIGMAC is summarized in Table 1, where the quantity of misassemblies is obtained from the QUAST reports. By default, QUAST’s statistics are based on contigs of size >= 500 bp. Interested readers can refer to the Additional file 1: Appendix for more details of the assessment. The data is provided in our online distribution and users can reproduce the results by issuing a single command python reproduce.py.
Comparison with other postprocessing tools
Synthetic data
We use the synthetic data in “Synthetic data” section to evaluate and benchmark BIGMAC, FinisherSC [4], SSPACE_LongRead [15] and PBJelly [16]. BIGMAC is the only tool among the tested tools that fix the misassembled contigs and merge them back correctly. Other tested tools output the same misassembled contigs.
Real data
We perform endtoend testing to compare performance of different tools. The comparison is shown in Table 2. BIGMAC shows the largest N75/# Misassemblies on all three datasets and largest N50/# Misassemblies on two out of three datasets. Indeed, in the only dataset that BIGMAC does not have the largest N50/#Misassemblies, the number of contigs(i.e. L50) beyond N50 is 7. Due to the small number of contigs, this particular measurement on that dataset may not be significant. We also remark that BIGMAC is the only tool that improves contiguity (N50 and N75) and the number of misassemblies.
Simulation and testing on independent components
We perform microbenchmarking on individual components of BIGMAC. The settings and results are summarized as follows.
Analysis of LocatePotentialMisassemblies
We test our break point finding algorithm used in LocatePotentialMisassemblies of Breaker on the synthetic data of x _{1}=a b c d e,x _{2}=f b c g,x _{3}=h c d i discussed in the previous section. The algorithm succeeds in identifying the minimum number of break points as 3. Also, it succeeds in identifying the minimum number of break points as 2 in the presence of double stranded DNA, in the test case of x _{1}=a b c d,x _{2}=e c ^{′} b ^{′} f, where b ^{′},c ^{′} are the reverse complement of b,c respectively.
Analysis of ConfirmBreakPoints
We test IterativeCTest used in ConfirmBreakPoints of Breaker on synthetic data. We simulate misassemblies and variation on abundances by generating a sequence of Poisson random variables and compare the performance of the algorithms on the simulated data as follows. We generate a sequence of 100 numbers by 100 independent Poisson random variables. The Poisson random variables have parameters of either 20 or 50. To select the parameters, we simulate 100 steps of a twostates Markov chain with transition probability of 0.1. We then evaluate the performance of CTest and IterativeCTest on finding the true transition points, which correspond to the junctions of misassemblies. Taking average from 100 rounds of simulation, the recall of both CTest and IterativeCTest are 0.93, meaning that they both can identify almost all transition points. CTest has precision of 0.75 while the precision of IterativeCTest is of 0.87, meaning that IterativeCTest can avoid more fake misassemblies.
Analysis of merger
We compare Merger with other tools on synthetic data as follows. We use a synthetic contig set {x _{1},x _{2},r,y _{1},y _{2}} where the ground truth genomes are (x _{1},r,y _{1}),(x _{2},r,y _{2}). The coverage depth of (x _{1},y _{1}) and (x _{2},y _{2}) are 20X and 50X respectively. We pass the reads together with the contig set to FinisherSC, PBJelly, SSPACE_LongRead to perform scaffolding. We note that BIGMAC is the only tool the can scaffold the contigs correctly into 2 contigs by using the abundance information among the tested tools. Other tools either do not change the input or simply merge r with some of {x _{1},x _{2},y _{1},y _{2}}, resulting in 4 contigs.
Runtime of BIGMAC
The runtime of BIGMAC is summarized in Table 3. We use 20 threads to run the tool on a server computer. The server computer is equipped with 64 AMD Opteron(tm) Processor 6380(8 cores) with frequency 2500 MHz and 362 GB RAM. We note that the majority of the time is spent on alignment of contigs and reads by MUMmer.
Discussion
There are two main implications from the experiments performed. First, we show that postprocessing assemblies is a feasible alternative in improving assembly quality to building another assembler from scratch. This is demonstrated by BIGMAC showing simultanous improvement in terms of number of misassembly and contiguity. We note that this is a nontrivial feature because all other tested tools fail to achieve it. Second, BIGMAC is competitive when compared to the existing postprocessing tools. In particular, it shows better N75/# Misassemblies than all other tested tools in all tested datasets. Moreover, BIGMAC is also the only tool that can handle abundance information, which makes it an attractive candidate for improving metagenomic assembly.
Future work
We remark that the creation of BIGMAC sheds light on many interesting future direction. For example, it would be interesting to apply similar ideas to hybrid data, which has a lot of potential in the context of metagenomics. Moreover, it would also be useful to try BIGMAC and other postprocessing methodology on more real data to better characterize the approach.
Conclusion
We study an agile approach in developing de novo metagenomic assemblers: postprocessing metagenomic assemblies with original input data. BIGMAC demonstrates the effectiveness of the postprocessing approach in improving the quality of metagenomic assemblies using long reads. BIGMAC thus serves as an example that developing postprocessors is a good alternative to building endtoend assemblers, which typically involves more engineering efforts.
References
Chen K, Pachter L. Bioinformatics for wholegenome shotgun sequencing of microbial communities. PLoS Comput Biol. 2005; 1(2):106–12.
The Critical Assessment of Metagenome Interpretation (CAMI) competition. http://blogs.nature.com/methagora/2014/06/thecriticalassessmentofmetagenomeinterpretationcamicompetition.html. Accessed 29 Mar 2016.
Koren S, Phillippy AM. One chromosome, one contig: complete microbial genomes from longread sequencing and assembly. Curr Opin Microbiol. 2015; 23:110–20.
Lam KK, LaButti K, Khalak A, Tse D. Finishersc: a repeataware tool for upgrading de novo assembly using long reads. Bioinformatics. 2015; 31(19):3207–209.
Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, Clum A, Copeland A, Huddleston J, Eichler EE, et al. Nonhybrid, finished microbial genome assemblies from longread smrt sequencing data. Nat Methods. 2013; 10(6):563–9.
Gurevich A, Saveliev V, Vyahhi N, Tesler G. Quast: quality assessment tool for genome assemblies. Bioinformatics. 2013; 29(8):1072–1075.
Delcher AL, Phillippy A, Carlton J, Salzberg SL. Fast algorithms for largescale genome alignment and comparison. Nucleic Acids Res. 2002; 30(11):2478–483.
Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED, et al. Hybrid error correction and de novo assembly of singlemolecule sequencing reads. Nat Biotechnol. 2012; 30(7):693–700.
Przyborowski J, Wilenski H. Homogeneity of results in testing samples from poisson series: With an application to testing clover seed for dodder. Biometrika. 1940; 31(3/4):313–323. JSTOR.
Myers EW. The fragment assembly string graph. Bioinformatics. 2005; 21(suppl 2):79–85.
Elias I. Settling the intractability of multiple alignment. J Comput Biol. 2006; 13(7):1323–1339.
Documentation of MetaFinisherSC. https://github.com/kakitone/MetaFinisherSC. Accessed 29 Mar 2016.
Hall RJ, Chin CS, Mehrotra S, Juretic N, Wasserscheid J, Dewar K. An interactive workflow for the analysis of contigs from the metagenomic shotgun assembly of SMRT Sequencing data. 2014. http://files.pacb.com/pdf/RHall_ASM2014_InteractiveWorkflow.pdf. Accessed 29 Mar 2016.
PacBio. PacBio Devnet. https://github.com/PacificBiosciences/DevNet/wiki/Human_Microbiome_Project_MockB_Shotgun. Accessed 29 Mar 2016.
Boetzer M, Pirovano W. Sspacelongread: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinforma. 2014; 15(1):1.
English AC, Richards S, Han Y, Wang M, Vee V, Qu J, Qin X, Muzny DM, Reid JG, Worley KC, et al. Mind the gap: upgrading genomes with pacific biosciences rs longread sequencing technology. PloS ONE. 2012; 7(11):47768.
Acknowledgements
We would like to thank Jon Badalamenti (University of Minnesota), Jason Chin (PacBio), James Drake (PacBio), Asif Khalak (Samsung, mHealth), Kurt LaButti (JGI), Julia Oh (Jax.org), Lior Pachter (UC Berkeley), Lorian Schaeffer (UC Berkeley), David Tse (Stanford University), Lizzy Wilbanks (Caltech), Siu Ming Yiu (University of Hong Kong) for discussion. We would also like to thank Jon Badalamenti (University of Minnesota), Julia Oh (Jax.org), Lizzy Wilbanks (Caltech) for test data. This material is based upon the work supported by the National Science Foundation under grant numbers CCF1528174 and CCF1535989.
Availability of data and materials
Information of the software BIGMAC is as follows.

Project name: BIGMAC

Project home page: https://github.com/kakitone/BIGMAC

Archived version: Not applicable

Operating system(s): Linux or Mac OS

Programming language: Python

Other requirements: MUMmer 3.23,

Any restrictions to use by nonacademics: Not applicable
Moreover, the datasets supporting the conclusions of this article are available in the BIGMAC repository (by issuing python reproduce.py), and hyperlinks to datasets are located in

https://berkeley.box.com/shared/static/m3x4czbxbh87vq078li3llr3w0ba3i8n.tar

https://berkeley.box.com/shared/static/pq98sgh3lb6vqvxrsczfrphidf1ik1gj.tar

https://berkeley.box.com/shared/static/h05c4hl6fk8h8k95w0y17jtkeuzsz0kp.tar

https://berkeley.box.com/shared/static/7ub5gqnd61pfvay6knl0ipklmrbzcsjq.tar
Authors’ contributions
KKL, RH, AC and SR contributed in the design of the study and writing of the manuscript. KKL implemented the software. All authors read and approved the final manuscript.
Competing interests
Richard Hall is an employee of Pacific Biosciences at the time the manuscript is written.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Additional file
Additional file 1
Supplementary materials include implementation details, data analysis, theoretical analysis and commands used to run different tools. (PDF 492 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lam, KK., Hall, R., Clum, A. et al. BIGMAC : breaking inaccurate genomes and merging assembled contigs for long read metagenomic assembly. BMC Bioinformatics 17, 435 (2016). https://doi.org/10.1186/s128590161288y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128590161288y
Keywords
 Genome assembly
 Next generation sequencing
 Metagenomics