 Proceedings
 Open Access
 Published:
SWAPAssembler: scalable and efficient genome assembly towards thousands of cores
BMC Bioinformatics volume 15, Article number: S2 (2014)
Abstract
Background
There is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data. Traditional assembly methods requiring long execution time and large amount of memory on a single workstation limit their use on these massive data.
Results
This paper presents a highly scalable assembler named as SWAPAssembler for processing massive sequencing data using thousands of cores, where SWAP is an acronym for Small World Asynchronous Parallel model. In the paper, a mathematical description of multistep bidirected graph (MSG) is provided to resolve the computational interdependence on merging edges, and a highly scalable computational framework for SWAP is developed to automatically preform the parallel computation of all operations. Graph cleaning and contig extension are also included for generating contigs with high quality. Experimental results show that SWAPAssembler scales up to 2048 cores on Yanhuang dataset using only 26 minutes, which is better than several other parallel assemblers, such as ABySS, Ray, and PASHA. Results also show that SWAPAssembler can generate high quality contigs with good N50 size and low error rate, especially it generated the longest N50 contig sizes for Fish and Yanhuang datasets.
Conclusions
In this paper, we presented a highly scalable and efficient genome assembly software, SWAPAssembler. Compared with several other assemblers, it showed very good performance in terms of scalability and contig quality. This software is available at: https://sourceforge.net/projects/swapassembler
Background
To cope with massive sequence data generated by nextgeneration sequencing machines, a highly scalable and efficient parallel solution for fundamental bioinformatic applications is important [1, 2]. With the help of high performance computing, cloud computing [3, 4], and manycores in GPU [5], successful scalable examples have been seen in many embarrassingly parallel applications: sequence alignment [6–8], SNP searching [9, 10], expression analysis [11], etc. However, for tightly coupled graph related problems, such as genome assembly, a scalable solution is a still a big challenge [12, 13].
Stateoftheart trials on parallel assemblers include ABySS [14], Ray [15], PASHA [16], and YAGA [17–19]. ABySS adopts the traditional de Bruijn graph data structure proposed by Pevzner et. al. [20] and follows the similar assembly strategy as EULER SR [21] and Velvet [22]. The parallelization is achieved by distributing kmers to multiservers to build a distributed de Bruijn graph, and error removal and graph reduction are implemented over MPI communication primitives. Ray extends kmers (or seeds) into contigs with a heuristical greedy strategy by measuring the overlapping level of reads in both direction. Based on the observation that the time consuming part of genome assembly are generating and distributing kmers, constructing and simplifying the distributed de Bruijn graph, PASHA concentrates its effort on parallelizing these two stages to improve its efficiency. However, PASHA allows only single process for each unanimous path, and this limits its degree of parallelism. In their experiments, ABySS and PASHA take about 87 hours and 21 hours to assembly the Yoruban male genome with a coverage of 42X.
To avoid merging kmers on two different servers, which can result in too many small interprocess messages and the communication latency, YAGA constructs a distributed de Bruijn graph by maintaining edge tuples in a community of servers. Reducible edges belonging to one unanimous path are grouped into one server using a list rank algorithm [23], then these unanimous paths are reduced locally on separated servers. The complexity of YAGA is bounded by O\left(\frac{n}{p}\right) computing time, O\left(\frac{n}{p}\right) communication volume, and O(log^{2}(n)) communication rounds, where n is the number of nucleotides in all reads, and p denotes the number of processors. Due to the fact that the recursive list ranking algorithm used in YAGA has a memory usage of O\left(\frac{n\text{log}n}{p}\right), this will use large amount of memory and cause low efficiency.
Our previous work [24] tries to avoid access collision of merging two neighbor edges. In this work, 1step bidirected graph and a computational model named as SWAP are proposed for edge merging operation. In its experiments, the prototype of edge merging algorithm using SWAP can scale to 640 cores on both Yeast and C.elegans dataset. However this exploratory work only focuses on the edge merging operation of genome assembly, some other important problems are not addressed, for example, contig extension, complexity analysis etc.
The scalability of previous assemblers is affected by the computational interdependence on merging kmers/edges in unanimous paths. Sequential assemblers, for example Velvet and SOAPdenovo, process each path sequentially. Parallel assemblers can process several paths in parallel, however kmers/edges sharing one path are merged one by one. SWAPAssembler resolves the computational interdependence on merging edges sharing the same path with MSG. For each path, at most half of its edges can be merged concurrently in each round, and merging multiple edges on the same path can be done in parallel using SWAP computational framework. In Figure 1, the parallel strategy of SWAPAssembler is compared with other assemblers using an example of two linked paths, we can see that a deeper parallelism on edge merging can be achieved by SWAPAssembler.
In this paper, we present a highly scalable and efficient genome assembler named as SWAPAssembler, which can scale to thousands of cores on processing massive sequencing data such as Yanhuang (500G). SWAPAssembler includes five fully parallelized steps: input parallelization, graph construction, graph cleaning, graph reduction and contig extension. Compared with our previous work, two fundamental improvements have been made for graph reduction. Firstly MSG is presented as a comprehensive mathematical abstraction for genome assembly. Using MSG and semigroup theory, the computational interdependence on merging edges is resolved. Secondly, we have developed a scalable computational framework for SWAP, this framework triggers the parallel computation of all operations with no interference. In this paper, complexity of this framework and SWAPAssembler is also analyzed and proved in detail. In addition, two steps in SWAPAssembler are used to improve the quality of contigs. One is graph cleaning, which adopts the traditional strategy of removing kmolecules and edges with low frequency, and the other one is contig extension, which resolves special edges and some cross nodes using a heuristic method. Experimental results show that SWAPAssembler can scale up to 2048 cores on Yanhuang dataset using only 26 minutes, which is the fastest compared with other assemblers, such as ABySS, Ray and PASHA. Conitg evaluation results confirm that SWAPAssembler generates good results on N50 size with lowest error rate for S. aureus and R. sphaeroides datasets. When processing larger datasets (Fish and Yanhuang) without using external error correction tools, SWAPAssembler generates the longest N50 contig sizes of 1309 bp and 1368 bp for these two datasets.
Methods
In this section, our method for genome assembly towards thousands of cores is presented. We first abstract the genome assembly problem with MSG. Generating longer sequences (contigs) from shorter sequences corresponds to merging semiextended edges to fullextended edges in MSG. In addition, computational interdependence of edge merging is resolved by introducing a semigroup over a closed edge set Es V 0 in MSG. The edge set Es V 0 is proved to be a semigroup with respect to edge merging operation. According to the associativity law of semigroup, the final results will be the same as long as all edges have been merged regardless of the merging order, thus these edge merging operations can be computed in parallel.
In order to maximally utilize the potential parallelism resolved by MSG, a scalable SWAP computational framework is developed. As one edge may be accessed by two merging operations in two different processes at the same time, a lockcomputingunlock mechanism introduced in [24] is adopted for avoiding the conflict. For the problems which can be abstracted with semigroup, the corresponding operations can be done in parallel, and SWAP computational framework can achieve linearly scale up for these problems.
Based on MSG and SWAP computational framework, SWAPAssembler is developed with five steps, including input parallelization, graph construction, graph cleaning, graph reduction, and contig extension. In the following, we first present MSG and the SWAP computational framework, then details of SWAPAssembler's five steps will be followed.
Mathematical formulation of genome assembly using MSG
Given a biological genome sample with reference sequence w ∈ ℕ^{g} , where N = {A, T, C, G}, g = w, a large number of short sequences called reads, S = {s_{1}, s_{2}, ..., s_{ h }}, can be generated from the sequencing machines. Genome assembly is the process of reconstructing the reference genome sequence from these reads. Unfortunately, the genome assembly problem of finding the shortest string with all reads as its substring falls into a NPhard problem [25].
Finding the original sequence from all possible Euler paths cannot be solved in polynomial time [26]. In real cases, gaps and branches caused by uneven coverage, erroneous reads and repeats prevent obtaining full length genome, and a set of shorter genome sequences called contigs are generated by merging unanimous paths instead. Our method focuses on finding a mathematical and highly scalable solution for the following standard genome assembly (SGA) problem, which is also illustrated in Figure 2.
Problem of Standard Genome Assembly (SGA)
Input: Given a set of reads without errors S = {s_{1}, s_{2}, ..., s_{ h }}
Output: A set of contigs C = {c_{1}, c_{2}, ..., c_{ w }}
Requirement: Each contig maps to an unanimous path in the De Bruijn graph constructed from the set of reads S.
Preliminaries
We first define some variables. Let s ∈ ℕ^{l} be a DNA sequence of length l. Any substring derived from s with length k, is called a kmer of s, and it is denoted by α = s[j]s[j + 1] . . . s[j + k − 1], 0 ≤ j < l − k + 1. The set of kmers of a given string s can be written as Z(s, k), where k must be odd. The reverse complement of a kmer α, denoted by \alpha \prime, is obtained by reversing α and complementing each base by the following bijection of M, M : {a → t, t→a, c→g, g → c}. Note that \alpha ={\alpha}^{{\prime}^{\prime}}.
A kmolecule \hat{\alpha} is a pair of complementary kmers {α, α'}. Let .> be the partial ordering relation between the strings of equal length, and \alpha .>\beta indicates that α is lexicographically larger than β. We designate the lexicographically larger one of two complementary kmers as the positive kmer, denoted as α^{+}, and the smaller one as the negative kmer, denoted as α−, where {\alpha}^{+}.>{\alpha}^{}. We choose the positive kmer α^{+} as representative kmer for kmolecule {α, α'}, denoted as α^{+}, implying that \hat{\alpha}={\alpha}^{+}=\left\{{\alpha}^{+},{\alpha}^{}\right\}=\left\{\alpha ,\alpha \prime \right\}. The relationship between kmer and kmolecule is illustrated in Figure 3. The set of all kmolecules of a given string s is known as kspectrum of s, and it can be written as S(s, k). Noted that S(s, k) = S(s', k).
Notation suf (a, l) (pre(a, l), respectively) is used to denote the length l suffix (prefix) of string a. The symbol ○ is introduced to denote the concatenation operation between two strings. For example, if s_{1} = "abc", s_{2} = "def ", then s_{1} °s_{2} = "abcdef ". The number of edges attached to kmolecule α ˆ is denoted as degree(\hat{\alpha}). All notations are listed in Table 1.
1step Bidirected Graph
Definition 1: 1step bidirected graph. The 1step bidirected de Bruijn graph of order k for a given string s can be presented as,
In the rest of the paper, 1step bidirected de Bruijn graph of order k is abbreviated as 1step bidirected graph. In equation (1), the vertex set V_{ s } is the kspectrum of s,
and the 1step bidirected edge set {E}_{s}^{1} is defined as follows,
Equations (3) declares that any two overlapped kmolecules can be connected with one 1step bidirected edge when they are consecutive in sequence s or the complementary sequence s'. Here d_{ α } denotes the direction of kmer α, if α = α^{+}, d_{ α } ='+', otherwise d_{ α } = ''. {c}_{\alpha \beta}^{1} is the content or label of the edge, and is initialized with β[k − 1], that is {c}_{\alpha \beta}^{1}=\beta \left[k1\right], and we have su\phantom{\rule{2.77695pt}{0ex}}f\left({\alpha}^{\circ}{c}_{\alpha \beta}^{1},k\right)=\beta.
Lemma 1. Given two kmolecules \hat{\alpha},\hat{\beta}\in \mathbb{S}\left(s,k\right), there are four possible connections, and for each type of connection exactly two equivalent 1step bidirected edge representations exist,

1
{e}_{\alpha +\beta +}^{1}=\left({\alpha}^{+},{\beta}^{+},+,+,{c}_{\alpha +\beta +}^{1}\right),\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{e}_{\beta \alpha }^{1}=\left({\beta}^{},{\alpha}^{},,,{c}_{\beta \alpha }^{1}\right)

2
{e}_{\alpha +\beta }^{1}=\left({\alpha}^{+},{\beta}^{},+,,{c}_{\alpha +\beta }^{1}\right),\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{e}_{\beta +\alpha }^{1}=\left({\beta}^{+},{\alpha}^{},+,=,{c}_{\beta +\alpha }^{1}\right)

3
{e}_{\alpha \beta +}^{1}=\left({\alpha}^{},{\beta}^{+},,+,{c}_{\alpha \beta }^{1}\right),\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{e}_{\beta \alpha +}^{1}=\left({\beta}^{},{\alpha}^{+},,+,{c}_{\beta \alpha +}^{1}\right)

4
{e}_{\alpha \beta }^{1}=\left({\alpha}^{},{\beta}^{},,,{c}_{\alpha \beta }^{1}\right),\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{e}_{\beta +\alpha +}^{1}=\left({\beta}^{+},\phantom{\rule{2.77695pt}{0ex}}{\alpha}^{+},+,+,{c}_{\beta +\alpha +}^{1}\right)
In each type of connection, the first bidirected edge representation and the second one are equivalent. The first bidirected edge is associated with kmolecule \hat{\alpha}, and the second one is associated with \hat{\beta}. Figure 4 illustrates all four possible connections. For example in Figure 4(a), a positive kmer "TAG" points to positive kmer "AGT" with a label "A", and the corresponding edge is {e}_{T\phantom{\rule{2.77695pt}{0ex}}AG\phantom{\rule{2.77695pt}{0ex}}AGT}^{1}=\left(TAG,AGT,+,+,T\right).
Given a set of reads S = {s_{1}, s_{2}, . . . , s_{h}}, a 1step bidirected graph derived from S with order k is,
Each read s_{ i } corresponds to a path in {G}_{k}^{1}\left(\mathbb{S}\right), and read s_{ i } can be recovered by concatenating (k − 1)prefix of the first kmolecule and the edge labels on the path consisted by S(s_{ i }, k). As an example, an 1step bidirected de Bruijn graph derived from S = {"T AGT CG", "AGT CGA", "T CGAGG"} is plotted in Figure 5.
Multistep Bidirected Graph and Its Properties
Definition 2: edge merging operation. Given two 1step bidirected edges {e}_{\alpha \beta}^{1}=\left(\alpha ,\beta ,{d}_{\alpha},{d}_{\beta}.{C}_{\alpha \beta}^{1}\right) and {e}_{\beta \gamma}^{1}=\left(\beta ,\gamma ,{d}_{\beta},{d}_{\gamma},{C}_{\beta \gamma}^{1}\right) in a 1step bidirected graph, if {e}_{\alpha \beta}^{1}.{d}_{\beta}={e}_{\beta \gamma}^{1}.{d}_{\beta} and degree(\hat{\beta}) = 2, we can obtain a 2step bidirected edge {e}_{\alpha \gamma}^{2}=\left(\alpha ,\gamma ,{d}_{\alpha},{d}_{\gamma},{c}_{\alpha \gamma}^{2}\right) by merging edges {e}_{\alpha \beta}^{1} and {e}_{\beta \gamma}^{1}, {c}_{\alpha \gamma}^{2}={c}_{\alpha \beta}^{1}\circ {c}_{\beta \gamma}^{1}. Using symbol ⊗ to denote edge merging operation between two bidirected edges attached to the same kmolecule with the same direction, and the 2step bidirected edge is written as,
Two edges {e}_{\alpha \gamma}^{2} and {e}_{\gamma \alpha}^{2} in equation (5) are equivalent, indicating it is same to apply edge merging operation on {e}_{\alpha \beta}^{1} and {e}_{\beta \gamma}^{1}, and to apply edge merging operation on {e}_{\gamma \beta}^{1} and {e}_{\beta \alpha}^{1}. Figure 6 shows an example on edge merging operation.
Zero edge 0 is defined to indicate all nonexisting bidirected edges. Note that 0\otimes {e}_{\alpha \beta}^{x}=0,\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{e}_{\alpha \beta}^{x}\otimes 0=0. A zstep bidirected edge can be obtained by,
Definition 3: Multistep Bidirected Graph(MSG). A MSG derived from a read set S = {s_{1}, s_{2}, . . . , s_{ h }}, is written as,
where g is the length of reference sequence w, {E}_{si}^{j}+\left\{{e}_{\alpha \beta}^{j}\text{}\forall \hat{\alpha},\hat{\beta}\in \mathbb{S}\left({s}_{i},k\right)\right\}. A MSG is obtained through edge merging operations.
Given an xstep bidirected edge {e}_{\alpha \beta}^{x}=\left(\alpha ,\beta ,{d}_{\alpha},{d}_{\beta},{c}_{\alpha \beta}^{x}\right), if there exists edge {e}_{\gamma \alpha}^{y} or {e}_{\beta \gamma}^{z} satisfying {e}_{\gamma \alpha}^{y}\otimes {e}_{\alpha \beta}^{x}\ne 0 or {e}_{\alpha \beta}^{x}\otimes {e}_{\beta \gamma}^{z}\ne 0, then we call edge {e}_{\alpha \beta}^{x} as a semiextended edge, and the corresponding kmolecule α ˆ or β ˆ as semiextended k molecule. If {e}_{\alpha \beta}^{x} cannot be extended by any edge, this edge is called as fullextended edge, and kmolecule \hat{\alpha} and \hat{\beta} are fullextended k molecules. In Figure 5 and 6, semiextended kmolecule and fullextended kmolecule are plotted with different colors (yellow for semiextended kmolecules and blue for fullextended kmolecules), semiextended edge and fullextended edge are drawn with different lines (broken line for semiextended edges and real line for fullextended edges).
Property 1. If the set of fullextended edges in the MSG defined in equation 7 is denoted as {E}_{S}^{*}, then the set of labels on all edges in {E}_{S}^{*} can be written as,
and we have {L}_{S}^{*}=C, C is the set of contigs. The proof is presented in Appendix 1.
Property 2. Edge merging operation ⊗ over the multistep bidirected edge set {E}_{S}\vee 0 is associative, and Q({E}_{S}\vee 0,⊗) is a semigroup. The proof is presented in Appendix 1.
The key property of 1step bidirected graph {G}_{k}^{1}\left(S\right) is that each read s corresponds to a path starting from the first kmolecule of s and ending at the last kmolecule. Similarly, each chromosome can also be regarded as a path. However because of sequencing gaps, read errors, and repeats in the set of reads, chromosome will be broken into pieces, or contigs. Within a MSG, each contig corresponds to one fullextended edge in {G}_{k}\left(S\right), and this has been presented and proved in Property 1. Property 2 ensures that the edge merging operation ⊗ over the set of multistep bidirected edges has formed a semigroup, and this connects the standard genome assembly (SGA) problem with edge merging operations in semigroup. According to the associativity law of semigroup, the final fullextended edges or contigs will be the same as long as all edges have been merged regardless of the merging order, thus these edge merging operations can be computed in parallel. Finally in order to reconstruct the genome with a large set of contigs, we need to merge all semiextended edges into fullextended edges in semigroup Q({E}_{S}\vee 0, ⊗).
SWAP computational framework
The lockcomputingunlock mechanism of SWAP was first introduced in our previous work [24], where no implementation details and complexity analysis is given. In this section, we present the mathematical description of the problems which can be solved by SWAP, then a scalable computational framework for SWAP model and its programming interface are presented. Its complexity and scalability is analyzed in Appendix 3.
Definition 4: small world of operations. Semigroup SG(A, R) is defined on set A with an associative operation R : A × A → A. R(a_{ i }, a_{ j }) is used to denote the associative operation on a_{ i } and a_{ j }, a_{ i }, a_{ j } ∈ A. The elements a_{ i } and a_{ j }, together with the operation R(a_{ i }, a_{ j } ) are grouped as a small world of the operation R(a_{ i }, a_{ j }). We denote this small world as [a_{ i }, a_{ j }], and [a_{ i }, a_{ j }] = {R(a_{ i }, a_{ j }), a_{ i }, a_{ j }}.
Activity ACT (A, σ) are given on a semigroup SG(A, R) as the computational works performed by a graph algorithm, where operation set σ is a subset of R.
In real application, an operation corresponds to a basic operation of a given algorithm. For example, for MSG based genome assembly application, an operation can be defined as edge merging. For topological sorting, reordering a pair of vertices can be defined as an operation.
For any two small worlds [a_{1}, a_{2}], [b_{1}, b_{2}], where a_{1} ≠ b_{1}, a_{1} I= b_{2}, a_{2} ≠ b_{1}, a_{2} ≠ b_{2}, the corresponding operations R(a_{1}, a_{2}) and R(b_{1}, b_{2}), can be computed independently, thus, there exists potential parallelism in computing activity induced from the semigroup SG(A, R). We use SWAP for such parallel computing. The basic schedule of SWAP is LockComputingUnlock. For an operation R(a, b) in σ, the threesteps of SWAP are listed below:

1
Lock action is applied to lock R(a, b)'s small world [a, b].

2
Computing is performed for operation R(a, b), and the values of a, b are updated accordingly. In MSG, this corresponds to merging two edges.

3
Unlock action is triggered to release operation R(a, b)!s small world [a, b].
In SWAP computational framework, all operations σ in activity ACT (A, σ) can be distributed among a group of processes. Each process needs to fetch related elements, for example a and b, to compute operation R(a, b). At the same time, this process also has to cooperate with other processes for sending or updating local variables. Each process should have two threads, one is SWAP thread, which performs computing tasks using the threesteps schedule of SWAP, and the other is service thread, which listens and replies to remote processes. In the implementation of our framework, we avoid multithreads technology by using nonblocking communication and finitestate machine in each process to ensure its efficiency.
The activity ACT (A, σ) on set A with operations in σ can be treated as a graph G(σ, A) with σ as its vertices and A as its edges. Adjacent list is used to store the graph G(σ, A), and a hash function hashF un(x) is used to distribute the set σ into p subset for p processes, where \sigma ={\displaystyle \bigcup _{i=0}^{p1}}{\sigma}_{i}, and A_{ i } is associated with σ_{ i }. Note that ACT\left(A,\sigma \right)={\displaystyle \bigcup _{i=0}^{p1}}AC{T}_{i}\left({A}_{i},{\sigma}_{i}\right), which is illustrated in Figure 7. In Appendix 2, the pseudocodes of SWAP thread and service thread are demonstrated in Algorithm 1 and Algorithm 2, respectively.
Algorithm 1 describes the threesteps of SWAP computational framework, and Algorithm 2 on remote processers can cooperate with Algorithm 1 for running this schedule. The message functions, internal functions, and userdefined functions in Algorithm 1 and Algorithm 2 are listed in Table 2, where userdefined functions can be redefined for userspecific computational problems.
Similar to CSMA/CA in 802.11 protocol [27], Algorithm 1 adapts random backoff algorithm to avoid lock collision. A variety of backoff algorithms can be used, without loss of generality, binary exponential backoff [27] is used in SWAP thread. Note that all collided operations in σ_{ i } share only one binary backoff, so the cost can be ignored as long as the number of relations in σ_{ i } is huge.
The complexity and scalability analysis for SWAP computational framework are presented in Appendix 3. When the number of processes is less than the number of operations in σ, which is true for most cases, equation (??) shows that SWAP computational framework can linearly scale up with the increasing number of cores. When the number of processes is larger than the number of operations, according to equation (??) the running time will be dominated by the communication round, which is bounded by log(d_{ max }), where d_{ max } is the diameter of graph G(σ, A).
Implementation of SWAPAssembler
Based on MSG and SWAP computational framework, SWAPAssembler consists with five steps, including input parallelization, graph construction, graph cleaning, graph reduction, and contig extension. Complexity analysis of SWAPAssembler are presented in the end of this section.
Input parallelization
As the size of data generated by next generation sequencing technology generally has hundreds of Giga bytes, loading these data with one process costs hours to finish [14, 16]. Similar to Ray [15] and YAGA [18], we use input parallelization to speedup the loading process. Given input reads with n nucleotides from a genome of size g, we divide the input file equally into p virtual data block, p is the number of processes. Each process reads the data located in its virtual data block only once. The computational complexity of this step is bounded by O\left(\frac{n}{p}\right). For E.coli dataset of 4.4G bytes, SWAPAssembler loads the data into memory in 4 seconds with 64 cores while YAGA uses 516.5 seconds [18], and for Yanhuang dataset SWAPAssembler loads the data in 10 minutes while Ray costs 2 hour and 42 minutes.
Graph construction
This step aims to construct a 1step bidirected graph {G}_{k}^{1}\left(S\right)=\left\{{V}_{s},{E}_{s}^{1}\right\}, where V_{ s } and {E}_{s}^{1} are kmolecule set and 1step bidirected edge set. In this step, input sequences are broken into overlapping kmolecules by sliding a window of length k along the input sequence. A kmolecule can have up to eight edges, and each edge corresponds to a possible onebase extension, {A, C, G, T} in either direction. The adjacent kmolecule can be easily obtained by adding the base extension to the source kmolecule. The generated graph has O(n) kmolecules and O(n) bidirected edges distributed among p processors. Graph construction of 1step bidirected graph can be achieved in O\left(\frac{n}{p}\right) parallel computing time, and O\left(\frac{n}{p}\right) parallel communication volume.
An improvement to our previous work [24] is that the time usage on graph construction is overlapped with the previous step. As CPU computation and network communication can be performed when only partial data are loaded from the first step, they can be overlapped and combined as a pipeline. Computation and communication time used in this step are hid behind the time used on disk I/O in previous step.
Graph cleaning
This step cleans the erroneous kmolecules, based on the assumption that the erroneous kmers have lower frequency compared with the correct ones [19]. Assuming that the errors are random, we identify the kmolecules with low frequency as erroneous kmolecules, and delete them from the vertex set. SWAPAssembler also removes all edges with low frequency in the 1step bidirected graph, and the kmolecules without any attached edges. The frequency threshold can be set by users, or our method will calculate it automatically based on the average coverage of kmolecules. In our case, we prefer 3 ~ 10% of the average coverage as the threshold depending on the species.
All the operations in this step can be finished in O\left(\frac{n}{p}\right) parallel computation time, and about 60 ~ 80% of the kmolecules can be removed from our graph.
Graph reduction
In order to recover contigs, all semiextended edges in MSG need to be merged into fullextended edges. This task can be defined as edge merging computing activity and denoted as ACT (Es\vee 0,\sigma), where the edge merging operation set σ is,
in which {e}_{\beta \alpha}^{u} indicates an ustep bidirected edge connecting two vertices β and α. All semiextended edges of E_{ s } will be merged into fullextended edges finally.
In order to compute edge merging operations in σ_{ i } using our SWAP computational framework, two userdefined functions in Table 2 are described as Algorithm 3 and Algorithm 4 in Appendix 2. For each process, the edge merging step has a computing complexity of O\left(\frac{n}{p}\right), communication volume of \left(\frac{n\phantom{\rule{2.77695pt}{0ex}}\text{log}\left(\text{log}\left(g\right)\right)}{p}\right), and communication round of O(log(log(g))). The proposed methods has much less computation round of O(loglog(g)) than YAGA with O(log(n)2) [18, 19]. The detailed complexity analysis is provided in Appendix 4.
Contig extension
In order to extend the length of contigs while maintaining as less errors as possible, three types of special edges and two type of special vertices are processed in our method.
The first type of special edge is tip edge, which is connected with an terminal vertex and has a length less than k, where k is the kmer size. These tip edges are deleted from the graph. The second type is selfloop edge, whose beginning vertex and terminal vertex are same. If this vertex has another edge which can be merged with this selfloop edge, they will be merged, otherwise it will be removed. The last type is multiple edge, whose two vertices are directly connected by two different edges. In this case the edge with lower coverage will be removed.
In addition, processing two special vertices can help further improve the quality of contigs. The first is cross vertex shown in Figure 8(e), which has more than two edges on both sides. For each cross vertex, we sort all its edges according to their coverage. When the coverage difference between the two edges is less than 20%, then these two edges are merged as long as they can be merged regardless of other edges. The second vertex is virtual cross vertex shown in Figure 8(f). We treat edge e_{0} with its two end vertices as one virtual vertex A* and A* has more than two edges on both sides. All its edges are ranked according to their coverage. When the coverage difference between two edges on different nodes is less than 20%, these two edges will be merged with the edge e_{0} regardless of other edges. By processing these two special vertices using the heuristic method, we can partly resolve some of the repeats satisfying our strict conditions at the cost of introducing errors and mismatches into contigs occasionally.
The graph reduction step and contig extension step need to be iterated in a constant number of rounds to extend fullextended edges or stop when no special edges and special vertices can be found. The number of errors and mismatches introduced in contigs can be controlled by the percentage of special edges and vertices processed in contig extension stage. In our method, we process all edges and vertices aiming at obtaining longer contigs. The computing complexity for contig extension step will be bounded by O\left(\frac{n}{p}\right). As the graph shrinks greatly after graph reduction and contig extension step, all the remaining edges are treated as contigs. The complexity of SWAPAssembler is dominated by graph reduction step, which is bounded by O\left(\frac{n}{p}\right) parallel computing time, O\left(\frac{n\cdot \text{log}\left(\text{log}\left(g\right)\right)}{p}\right) communication volume, and O(log(log(n))) communication round. According to complexity analysis results of graph reduction step in equation (??), when the number of processors is less than the length of longest path d_{ max }, the speedup of SWAPAssembler can be calculated as follows,
Equation (10) indicates that, for a given genome with fixed length g, the speedup is proportional to the number of processors; while for a given number of processors p, the speedup is inversely proportional to logarithm of the logarithm of the genome size g. However when the number of processors is larger than the length of longest path d_{ max }, the running time will be bounded by the number of communication round, which is presented in equation (??) in Appendix 4. In either situation, we can conclude that the scalability or the optimal number of cores will increase with larger genomes.
Results
SWAPAssembler is a highly scalable and efficient genome assembler using multistep bidirected graph (MSG). In this section, we perform several computational experiments to evaluate the scalability and assembly quality of SWAPAssembler. In the experiments, TianHe 1A [28] is used as the high performance cluster. 512 computing nodes are allocated for the experiment with 12 cores and 24GB memory on each node. By comparing with several stateoftheart sequential and parallel assemblers, such as Velvet [22], SOAPdenovo [29], Pasha [16], ABySS [14] and Ray [15], we evaluate the scalability, quality of contigs in terms of N50, error rate and coverage for SWAPAssembler.
Experimental data
Five datasets in Table 3 are selected for the experiments. S. aureus, R. sphaeroides and human chromosome 14 (Hg14) datasets are taken from GAGE project [30], Fish dataset is downloaded from the Assemblathon 2 [31, 32], and Yanhuang dataset [33] is provided by BGI [34].
Scalability evaluation
The scalability of our method is first evaluated on a share memory machine with 32 cores and 1T memory. Five other assemblers including Velvet, SOAPdenovo, Pasha, ABySS and Ray, are included for comparison. Only the first three small datasets in Table 3 are used in this test due to the memory limitation. The results are presented in Table 4, and the corresponding figures are plotted in Figure 9. According to Table 4, SWAPAssembler has the lowest running time on all three datasets for 16 cores and 32 cores, and SOAPdenovo has the lowest time usage on 4 cores and 8 cores. According to Figure 9, SOAPdenovo, Ray and SWAPAssembler can scale to 32 cores, however Pasha and ABySS can only scale to 16 cores. Figure 9 also shows that Ray and SWAPAssembler can achieve nearly linear speedup and SWAPAssembler is more efficient than Ray.
The time usage for each step of SWAPAssembler on the share memory machine is also presented in Table 5 and Figure 10. The input parallelization step is overlapped with graph construction, thus we treat these two steps as one in this experiment. According to Table 5, for all three datasets the most timeconsuming step is graph reduction, and the fastest steps are graph cleaning and contig extension. Figure 10 shows that input parallelization & graph construction, graph cleaning and graph reduction can achieve nearly linear speedup when the number of cores increases from 4 to 32 cores, whereas the contig extension step does not benefit as much as other steps.
To further evaluate the scalability of SWAPAssembler from 64 to 4096 cores on TianHe 1A, we have compared our method with two parallel assemblers, ABySS and Ray, and the results are included in Table 6. According to Table 6 SWAPAssembler is 119 times and 73 times faster than ABySS and Ray for 1024 cores on the S. aureus dataset. On the same dataset, ABySS and Ray cannot gain any speedup beyond 64 and 128 cores, respectively. However SWAPAssembler scales up to 512 cores. For the R. sphaeroides dataset, ABySS, Ray, and SWAPAssembler can scale up to 128, 256, and 1024 cores, respectively.
For three larger datasets, Table 6 shows that scalability of SWAPAssembler is also better than the other two methods. On Hg14 dataset, SWAPAssembler is 280 times faster than ABySS, and 38 times faster than Ray when using 1024 cores. Similar to the results on R. sphaeroides dataset, three assemblers still hold their turning point of scalability at 128, 256 and 1024 cores, respectively. Fish and Yanhuang dataset cannot be assembled by ABySS and Ray in 12 hours, so their running times are not recorded in Table 6. For 1024 cores, SWAPAssembler assembles Fish dataset in 16 minutes, while it takes 26 minutes with 2048 cores to assemble the Yanhuang dataset. The speedup curves of SWAPAssembler on processing five datasets are shown in Figure 11. It shows that the speedup of assembling two small datasets have a turning point at 512 cores, and linear speedup to 1024 cores is achieved for other three larger datasets. SWAPAssembler can still benefits from the increasing cores up to 2048 cores on processing Yanhuang dataset.
Memory footprint is a bottleneck for assembling large genomes, and parallel assemblers is a solution for large genome assembly by using more memory on the computational nodes. For our case, Fish and Yanhuang genome assembly needs 1.6T bytes and 1.8T bytes memory, respectively. As in Tianhe 1A each server has 24G memory, Fish genomes cannot be assembled on a cluster with 64 servers. The same reasoning applies to Yanhuang dataset.
SWAPAssembler has better scalability compared with Ray and ABySS due to two important improvements. Firstly, computational interdependence of edge merging operations on one single unanimous path is resolved by MSG. Secondly, SWAP computational framework can trigger parallel computation of all operations without interference, and the communication latency is hidden by improving the computing throughput. Ray and ABySS cannot merge the kmers in a single linear chain in parallel, and PASHA can only parallelize the kmer merging work on different chains, which limits their degree of parallelism.
Assembly quality assessment
This part evaluate the assembly quality of SWAPAssembler. To be compatible with the comparison results from GAGE, we follow the error correction method of GAGE. ALLPATHLG [35] and Quake [36] are used to correct errors for S. aureus and R. sphaeriodes datasets. The corrected reads are used as the input to ABySS, Ray and SWAPAssembler. In addition, two other sequential genome assemblers, Velvet and SOAPdenovo, are selected in this experiment for comparison, and a machine with 1TB memory is used. The kmer size for all assemblers varies between 19 and 31, and best assembly results from the experiments of different kmer sizes for each assembler are reported in Table 7 and Table 8.
Table 7 presents the results of four metrics for evaluation: the number of contigs, N50, number of erroneous contigs and errorcorrected N50. Errorcorrected N50 is used to exclude the misleading assessment of larger N50 by correcting erroneously concatenated contigs. Each erroneous contig is broken at every misjoin and at every indel longer than 5 bases. From Table 7, SWAPAssembler generates 3 and 7 error contigs on S. aureus and R. sphaeriodes datasets, respectively, which are the smallest compared with other assemblers. N50 size and errorcorrected N50 size for SWAPAssembler are also longer than two other parallel assemblers, Ray and ABySS. SOAPdenovo has minimal number of contigs and largest N50 size for both datasets.
Table 8 summarizes the statistics of contigs generated for these two datasets. Three metrics in [30] are used to evaluate the quality of the contigs. In this table, "assembly size" close to its genome size is better. Larger "chaff size" can be indicative of the assembler being unable to integrate short repeat structures into larger contigs or not properly correcting erroneous bases. Coverage can be measured by the percentage of reference bases aligned to any assembled contig, which is "100%Unaligned ref bases" [30]. SWAPAssembler and Ray both have the smallest chaff size of 0.13% on R. sphaeroides dataset, and they show very close coverage and assembly sizes. In terms of S. aureus dataset, Ray has a lower chaff size of 0.10% compared with SWAPAssembler, however, SWAPAssembler generates better assembly size of 99.3% and larger coverage of 99.2%.
We also analyzed the contig statistics for three larger datasets and the results are presented in Table 9. Because these datasets do not have a standard reference set and the original script provided by GAGE requires a reference set, we wrote a script to analyze the assembly results using the number of contigs, N50 size, max length of contigs and bases in the contigs for evaluation. The original data of three datasets are directly processed by five assemblers with a fixed kmer size of 31. According to Table 9, the N50 size of contigs generated by SWAPAssembler is longest for all three datasets. For Fish and Yanhuang datasets, SWAPAssembler also performs best in the number of contigs and max length of contigs. However for SWAPAssembler on Hg14 dataset, whose reads are extracted from the human dataset by mapping the human chromosome 14, the number of contigs, max length of contigs and bases in contigs have a rank of second, third, and second, respectively. SWAPAssembler has a best N50 size for all datasets. This is because it has efficient graph cleaning and contig extension steps, which can handle sequencing errors efficiently. Four other assemblers, without the help from external tools on error correction, are affected by the quality of input reads on larger datasets.
In conclusion, SWAPAssembler is a highly scalable and efficient genome assembler. The evaluation shows that our assembler can scales up to 2048 cores, which is much better than other parallel assemblers, and the quality of contigs generated by SWAPAssembler is the best in terms of error rate for several small datasets and N50 size for two larger data sets.
Conclusion
In this paper, SWAPAssembler, a fast and efficient genome assembler scaling up to thousands of cores, is presented. In SWAPAssembler, two fundamental improvements are crucial for its scalability. Firstly, MSG is presented as a comprehensive mathematical abstraction for genome assembly. With MSG the computational interdependence is resolved. Secondly, SWAP computational framework triggers the parallel computation of all operations without interference. Two additional steps are included to improve the quality of contigs. One is graph cleaning, which adopts the traditional methods of removing kmolecules and edges with low frequency, and the other is contig extension, which resolves special edges and some cross nodes with a heuristic method. Results show that SWAPAssembler can scale up to 2048 cores on Yanhuang dataset using only 26 minutes, which is the best compared to other parallel assemblers, such as ABySS and Ray. Conitg evaluation results confirm that SWAPAssembler can generate good results on contigs N50 size and retain low error rate. When processing massive datasets without using external error correction tools, SWAPAssembler is immune from low data quality and generated longest N50 contig size.
For large genome and metagenome data of Tara bytes, for example the human gut microbial community sequencing data, highly scalable and efficient assemblers will be essential for data analysis. Our future work will extend our algorithm development for massive matagenomics dataset with additional modules.
The program can be downloaded from https://sourceforge.net/projects/swapassembler.
Additional file 1
Appendix. Appendix 1 property proof of MSG, Appendix 2 algorithms for SWAPAssembler, Appendix 3 complexity analysis of SWAP computational framework, Appendix 4 complexity analysis of graph reduction.
References
 1.
Schatz MC, Langmead B, Salzberg SL: Cloud computing and the dna data race. Nature biotechnology. 2010, 28 (7): 69110.1038/nbt0710691.
 2.
Stein LD: The case for cloud computing in genome informatics. Genome Biol. 2010, 11 (5): 20710.1186/gb2010115207.
 3.
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I: A view of cloud computing. Communications of the ACM. 2010, 53 (4): 5058. 10.1145/1721654.1721672.
 4.
Dean J, Ghemawat S: Mapreduce: simplified data processing on large clusters. Communications of the ACM. 2008, 51 (1): 107113. 10.1145/1327452.1327492.
 5.
Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC: Gpu computing. Proceedings of the IEEE. 2008, 96 (5): 879899. 10.1109/JPROC.2008.917757.
 6.
Matsunaga A, Tsugawa M, Fortes J: Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. eScience, 2008 eScience'08 IEEE Fourth International Conference On. 2008, IEEE, 222229. 10.1109/eScience.2008.62.
 7.
Wall DP, Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, Tonellato PJ: Cloud computing for comparative genomics. BMC bioinformatics. 2010, 11 (1): 25910.1186/1471210511259.
 8.
Liu CM, Wong T, Wu E, Luo R, Yiu SM, Li Y, Wang B, Yu C, Chu X, Zhao K: Soap3: ultrafast gpubased parallel alignment tool for short reads. Bioinformatics. 2012, 28 (6): 878879. 10.1093/bioinformatics/bts061.
 9.
Schatz MC: Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics. 2009, 25 (11): 13631369. 10.1093/bioinformatics/btp236.
 10.
Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL: Searching for snps with cloud computing. Genome Biol. 2009, 10 (11): 13410.1186/gb20091011r134.
 11.
Langmead B, Hansen KD, Leek JT: Cloudscale rnasequencing differential expression analysis with myrna. Genome Biol. 2010, 11 (8): 8310.1186/gb2010118r83.
 12.
McPherson JD: Nextgeneration gap. Nature Methods. 2009, 6: 25.
 13.
Shendure J, Ji H: Nextgeneration dna sequencing. Nature biotechnology. 2008, 26 (10): 11351145. 10.1038/nbt1486.
 14.
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol : Abyss: a parallel assembler for short read sequence data. Genome research. 2009, 19 (6): 11171123. 10.1101/gr.089532.108.
 15.
Boisvert S, Laviolette F, Corbeil J: Ray: simultaneous assembly of reads from a mix of highthroughput sequencing technologies. Journal of Computational Biology. 2010, 17 (11): 15191533. 10.1089/cmb.2009.0238.
 16.
Liu Y, Schmidt B, Maskell DL: Parallelized short read assembly of large genomes using de bruijn graphs. BMC bioinformatics. 2011, 12 (1): 35410.1186/1471210512354.
 17.
Jackson BG, Aluru S: Parallel construction of bidirected string graphs for genome assembly. Parallel Processing. 2008, ICPP'08. 37th International Conference On, pp. 346353 (2008). IEEE
 18.
Jackson BG, Schnable PS, Aluru S: Parallel short sequence assembly of transcriptomes. BMC bioinformatics. 2009, 10 (Suppl 1): 1410.1186/1471210510S1S14.
 19.
Jackson BG, Regennitter M, Yang X, Schnable PS, Aluru S: Parallel de novo assembly of large genomes from highthroughput short reads. Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium On. 2010, 110. IEEE
 20.
Pevzner PA, Tang H, Waterman MS: An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences. 2001, 98 (17): 97489753. 10.1073/pnas.171285098.
 21.
Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome research. 2008, 18 (2): 324330. 10.1101/gr.7088808.
 22.
Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research. 2008, 18 (5): 821829. 10.1101/gr.074492.107.
 23.
Dehne F, Song SW: Randomized parallel list ranking for distributed memory multiprocessors. International journal of parallel programming. 1997, 25 (1): 116. 10.1007/BF02700044.
 24.
Meng J, Yuan J, Cheng J, Wei Y, Feng S: Small world asynchronous parallel model for genome assembly. Network and Parallel Computing. 2012, Springer, Gwangju, 145155.
 25.
Pop M, Salzberg SL, Shumway M: Genome sequence assembly: Algorithms and issues. Computer. 2002, 35 (7): 4754. 10.1109/MC.2002.1016901.
 26.
Kapun E, Tsarev F: De bruijn superwalk with multiplicities problem is nphard. BMC bioinformatics. 2013, 14 (Suppl 5): 7
 27.
Andrew S: Computer Networks. 2003, Prentice Hall, The Netherlands
 28.
Yang XJ, Liao XK, Lu K, Hu QF, Song JQ, Su JS: The tianhe1a supercomputer: its hardware and software. Journal of Computer Science and Technology. 2011, 26 (3): 344351. 10.1007/s0201101111378.
 29.
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K: De novo assembly of human genomes with massively parallel short read sequencing. Genome research. 2010, 20 (2): 265272. 10.1101/gr.097261.109.
 30.
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M: Gage: A critical evaluation of genome assemblies and assembly algorithms. Genome Research. 2012, 22 (3): 557567. 10.1101/gr.131383.111.
 31.
Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert10 S, Chapman JA, Chapuis G, Chikhi R: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. arXiv preprint arXiv:1301.5406. 2013
 32.
Assemblathon 2. [http://assemblathon.org/assemblathon2]
 33.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y: Soapdenovo2: an empirically improved memoryefficient shortread de novo assembler. GigaScience. 2012, 1 (1): 1810.1186/2047217X118.
 34.
Huang Yan: The First Asian Diploid Genome. [http://yh.genomics.org.cn]
 35.
Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S: Highquality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences. 2011, 108 (4): 15131518. 10.1073/pnas.1017351108.
 36.
Kelley DR, Schatz MC, Salzberg SL: Quake: qualityaware detection and correction of sequencing errors. Genome Biol. 2010, 11 (11): 11610.1186/gb20101111r116.
Acknowledgements
The authors would like to thank for the suggestions and modifications from Francis YL Chin, SM Yiu, C.M. Leung, Yu Peng and Yi Wang from the University of Hongkong, and anonymous reviewers invited by Recombseq 2014. The calculations of this work were performed on TianHe1A at National Supercomputer Center in Tianjin.
Declarations
Funding for the publication of this article was provided by National Science Foundation of China under grant No. 11204342, the Science Technology and Innovation Committee of Shenzhen Municipality under grant No. JCYJ20120615140912201, and Shenzhen Peacock Plan under grant No. KQCX20130628112914299.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 9, 2014: Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBSeq 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S9.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
JT carried out the parallel genome assembly studies, participated in the development of SWAPAssembler and drafted the manuscript. BQ participated in the design and optimization of SWAPAssembler. YJ participated in the development of SWAPAssembler and modification of this manuscript. SZ participated in the design of the study and design of the performance test. Pavan conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Meng, J., Wang, B., Wei, Y. et al. SWAPAssembler: scalable and efficient genome assembly towards thousands of cores. BMC Bioinformatics 15, S2 (2014). https://doi.org/10.1186/1471210515S9S2
Published:
DOI: https://doi.org/10.1186/1471210515S9S2
Keywords
 genome assembly
 parallel computing
 De Bruijn graph
Comments
View archived comments (1)