SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores

Background There is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data. Traditional assembly methods requiring long execution time and large amount of memory on a single workstation limit their use on these massive data. Results This paper presents a highly scalable assembler named as SWAP-Assembler for processing massive sequencing data using thousands of cores, where SWAP is an acronym for Small World Asynchronous Parallel model. In the paper, a mathematical description of multi-step bi-directed graph (MSG) is provided to resolve the computational interdependence on merging edges, and a highly scalable computational framework for SWAP is developed to automatically preform the parallel computation of all operations. Graph cleaning and contig extension are also included for generating contigs with high quality. Experimental results show that SWAP-Assembler scales up to 2048 cores on Yanhuang dataset using only 26 minutes, which is better than several other parallel assemblers, such as ABySS, Ray, and PASHA. Results also show that SWAP-Assembler can generate high quality contigs with good N50 size and low error rate, especially it generated the longest N50 contig sizes for Fish and Yanhuang datasets. Conclusions In this paper, we presented a highly scalable and efficient genome assembly software, SWAP-Assembler. Compared with several other assemblers, it showed very good performance in terms of scalability and contig quality. This software is available at: https://sourceforge.net/projects/swapassembler


Background
To cope with massive sequence data generated by nextgeneration sequencing machines, a highly scalable and efficient parallel solution for fundamental bioinformatic applications is important [1,2]. With the help of high performance computing, cloud computing [3,4], and many-cores in GPU [5], successful scalable examples have been seen in many embarrassingly parallel applications: sequence alignment [6][7][8], SNP searching [9,10], expression analysis [11], etc. However, for tightly coupled graph related problems, such as genome assembly, a scalable solution is a still a big challenge [12,13].
ABySS adopts the traditional de Bruijn graph data structure proposed by Pevzner et. al. [20] and follows the similar assembly strategy as EULER SR [21] and Velvet [22]. The parallelization is achieved by distributing k-mers to multi-servers to build a distributed de Bruijn graph, and error removal and graph reduction are implemented over MPI communication primitives. Ray extends k-mers (or seeds) into contigs with a heuristical greedy strategy by measuring the overlapping level of reads in both direction. Based on the observation that the time consuming part of genome assembly are generating and distributing k-mers, constructing and simplifying the distributed de Bruijn graph, PASHA concentrates its effort on parallelizing these two stages to improve its efficiency. However, PASHA allows only single process for each unanimous path, and this limits its degree of parallelism. In their experiments, ABySS and PASHA take about 87 hours and 21 hours to assembly the Yoruban male genome with a coverage of 42X.
To avoid merging k-mers on two different servers, which can result in too many small inter-process messages and the communication latency, YAGA constructs a distributed de Bruijn graph by maintaining edge tuples in a community of servers. Reducible edges belonging to one unanimous path are grouped into one server using a list rank algorithm [23], then these unanimous paths are reduced locally on separated servers. The complexity of YAGA is bounded by O( n p ) computing time, O( n p ) communication volume, and O(log 2 (n)) communication rounds, where n is the number of nucleotides in all reads, and p denotes the number of processors. Due to the fact that the recursive list ranking algorithm used in YAGA has a memory usage of O( n log n p ), this will use large amount of memory and cause low efficiency. Our previous work [24] tries to avoid access collision of merging two neighbor edges. In this work, 1-step bidirected graph and a computational model named as SWAP are proposed for edge merging operation. In its experiments, the prototype of edge merging algorithm using SWAP can scale to 640 cores on both Yeast and C.elegans dataset. However this exploratory work only focuses on the edge merging operation of genome assembly, some other important problems are not addressed, for example, contig extension, complexity analysis etc.
The scalability of previous assemblers is affected by the computational interdependence on merging k-mers/edges in unanimous paths. Sequential assemblers, for example Velvet and SOAPdenovo, process each path sequentially. Parallel assemblers can process several paths in parallel, however k-mers/edges sharing one path are merged one by one. SWAP-Assembler resolves the computational interdependence on merging edges sharing the same path with MSG. For each path, at most half of its edges can be merged concurrently in each round, and merging multiple edges on the same path can be done in parallel using SWAP computational framework. In Figure 1, the parallel strategy of SWAP-Assembler is compared with other assemblers using an example of two linked paths, we can see that a deeper parallelism on edge merging can be achieved by SWAP-Assembler.
In this paper, we present a highly scalable and efficient genome assembler named as SWAP-Assembler, which can scale to thousands of cores on processing massive sequencing data such as Yanhuang (500G). SWAP-Assembler includes five fully parallelized steps: input parallelization, graph construction, graph cleaning, graph reduction and contig extension. Compared with our previous work, two fundamental improvements have been made for graph reduction. Firstly MSG is presented as a comprehensive mathematical abstraction for genome assembly. Using MSG and semi-group theory, the computational interdependence on merging edges is resolved. Secondly, we have developed a scalable computational framework for SWAP, this framework triggers the parallel computation of all operations with no interference. In this paper, complexity of this framework and SWAP-Assembler is also analyzed and proved in detail. In addition, two steps in SWAP-Assembler are used to improve the quality of contigs. One is graph cleaning, which adopts the traditional strategy of removing k-molecules and edges with low frequency, and the other one is contig extension, which resolves special edges and some cross nodes using a heuristic method. Experimental results show that SWAP-Assembler can scale up to 2048 cores on Yanhuang dataset using only 26 minutes, which is the fastest compared with other assemblers, such as ABySS, Ray and PASHA. Conitg evaluation results confirm that SWAP-Assembler generates good results on N50 size with lowest error rate for S. aureus and R. sphaeroides datasets. When processing larger datasets (Fish and Yanhuang) without using external error correction tools, SWAP-Assembler generates the longest N50 contig sizes of 1309 bp and 1368 bp for these two datasets.

Methods
In this section, our method for genome assembly towards thousands of cores is presented. We first abstract the genome assembly problem with MSG. Generating longer sequences (contigs) from shorter sequences corresponds to merging semi-extended edges to full-extended edges in MSG. In addition, computational interdependence of edge merging is resolved by introducing a semi-group over a closed edge set Es V 0 in MSG. The edge set Es V 0 is proved to be a semi-group with respect to edge merging operation. According to the associativity law of semigroup, the final results will be the same as long as all edges have been merged regardless of the merging order, thus these edge merging operations can be computed in parallel.
In order to maximally utilize the potential parallelism resolved by MSG, a scalable SWAP computational framework is developed. As one edge may be accessed by two merging operations in two different processes at the same time, a lock-computing-unlock mechanism introduced in [24] is adopted for avoiding the conflict. For the problems which can be abstracted with semi-group, the corresponding operations can be done in parallel, and SWAP computational framework can achieve linearly scale up for these problems.
Based on MSG and SWAP computational framework, SWAP-Assembler is developed with five steps, including input parallelization, graph construction, graph cleaning, graph reduction, and contig extension. In the following, we first present MSG and the SWAP computational framework, then details of SWAP-Assembler's five steps will be followed.

Mathematical formulation of genome assembly using MSG
Given a biological genome sample with reference sequence w ∈ N g , where N = {A, T, C, G}, g = |w|, a large number of short sequences called reads, S = {s 1 , s 2 , ..., s h }, can be generated from the sequencing machines. Genome assembly is the process of reconstructing the reference genome sequence from these reads. Unfortunately, the genome assembly problem of finding the shortest string with all reads as its substring falls into a NP-hard problem [25].
Finding the original sequence from all possible Euler paths cannot be solved in polynomial time [26]. In real cases, gaps and branches caused by uneven coverage, erroneous reads and repeats prevent obtaining full length genome, and a set of shorter genome sequences called contigs are generated by merging unanimous paths instead. Our method focuses on finding a mathematical and highly scalable solution for the following standard genome assembly (SGA) problem, which is also illustrated in Figure 2.
Problem of Standard Genome Assembly (SGA) Input: Given a set of reads without errors S = {s 1 , s 2 , ..., s h } Output: A set of contigs C = {c 1 , c 2 , ..., c w } Requirement: Each contig maps to an unanimous path in the De Bruijn graph constructed from the set of reads S.

Preliminaries
We first define some variables. Let s ∈ N l be a DNA sequence of length l. Any substring derived from s with Figure 1 Parallel strategy comparison on k-mers/edges merging for different assemblers. Two linked paths with 3 nodes and 5 nodes are given as an example, merging a linked path with 3 nodes needs 2 operations/rounds, and merging a path with 5 nodes needs 4 operations/rounds. To assemble these two paths, sequential assemblers need 6 operations, and parallel assemblers need 4 rounds. For SWAP-Assembler different processes can merge several edges on the same path in parallel using the SWAP computational framework, and merging of these 2 paths can be finished in 2 rounds. For a given sequencing data, if we treat the sequencing coverage as an constant number, the upper bound of the three assemby strategies on merging k-mers/edges are bounded by O(g), O(log(g)), and O(log(log(g))) respecitively, here g denotes the genome size and the longest path for a genome of length g will be bounded by O(log(g)). The upper bound of edge merging operations in SWAP-Assembler and expected length of longest path are proved in Appedix 3. Reference of the k-mers merging strategy for these assemblers can be found in their papers or codes. For velvet 1.1.04, the k-mers merging method can be found in its code "./src/concatenatedGraph.c"; for SOAPdenovo-V1.05, its method is in the code "./src/31mer/contig.c"; for ABySS 1.3.5, the method can be found in "./Parallel/NetworkSequenceCollection.cpp"; for YAGA, the method has been descripted in the last paragraph in the methods section [18]; for Pasha, its method is presented in the last paragraph at the graph simplification subsection [16]. length k, is called a k-mer of s, and it is denoted by a = s[j]s[j + 1] . . . s[j + k − 1], 0 ≤ j < l − k + 1. The set of k-mers of a given string s can be written as Z(s, k), where k must be odd. The reverse complement of a k-mer a, denoted by α , is obtained by reversing a and complementing each base by the following bijection of M, M : {a t, t a, c g, g c}. Note that α = α . A k-molecule α is a pair of complementary k-mers {a, a'}. Let . > be the partial ordering relation between the strings of equal length, and α. > β indicates that a is lexicographically larger than b. We designate the lexicographically larger one of two complementary k-mers as the positive k-mer, denoted as a + , and the smaller one as the negative k-mer, denoted as a−, where α + . > α − . We choose the positive k-mer a + as representative k-mer for k-molecule {a, a'}, denoted as a + , implying that α = α + = {α + , α − } = {α, α } . The relationship between kmer and k-molecule is illustrated in Figure 3. The set of all k-molecules of a given string s is known as k-spectrum of s, and it can be written as S(s, k). Noted that S(s, k) = S(s', k).
Notation suf (a, l) (pre(a, l), respectively) is used to denote the length l suffix (prefix) of string a. The symbol ○ is introduced to denote the concatenation operation between two strings. For example, if s 1 = "abc", s 2 = "def ", then s 1°s2 = "abcdef ". The number of edges attached to k-molecule aˆis denoted as degree ( α ). All notations are listed in Table 1.

1-step Bi-directed Graph
Definition 1: 1-step bi-directed graph. The 1-step bidirected de Bruijn graph of order k for a given string s can be presented as, In the rest of the paper, 1-step bi-directed de Bruijn graph of order k is abbreviated as 1-step bi-directed graph. In equation (1), the vertex set V s is the k-spectrum of s, (2) Figure 2 The process of genome assembly and the standard genome assembly (SGA) problem. and the 1-step bi-directed edge set E 1 s is defined as follows, Equations (3) declares that any two overlapped k-molecules can be connected with one 1-step bi-directed edge when they are consecutive in sequence s or the complementary sequence s'. Here d a denotes the direction of k-mer a, if a = a + , d a ='+', otherwise d a = '-'. c 1 αβ is the content or label of the edge, and is initialized with Given two k-molecules α, β ∈ S(s, k), there are four possible connections, and for each type of connection exactly two equivalent 1-step bi-directed edge representations exist, In each type of connection, the first bi-directed edge representation and the second one are equivalent. The first bi-directed edge is associated with k-molecule α , and the second one is associated with β . Figure 4 illustrates all four possible connections. For example in Figure 4-(a), a positive k-mer "TAG" points to positive k-mer "AGT" with a label "A", and the corresponding edge is e 1 T AG AGT = (TAG, AGT, +, +, T) . Given a set of reads S = {s 1 , s 2 , . . . , s h }, a 1-step bidirected graph derived from S with order k is, representative k-mer a + a + = "T AG" Each read s i corresponds to a path in G 1 k (S), and read s i can be recovered by concatenating (k − 1)-prefix of the first k-molecule and the edge labels on the path consisted by S(s i , k). As an example, an 1-step bi-directed de Bruijn graph derived from S = {"T AGT CG", "AGT CGA", "T CGAGG"} is plotted in Figure 5.

Multi-step Bi-directed Graph and Its Properties
Definition 2: edge merging operation. Given two 1-step bi-directed edges by merging edges e 1 αβ and e 1 βγ , c 2 αγ = c 1 αβ • c 1 βγ . Using symbol ⊗ to denote edge merging operation between two bidirected edges attached to the same k-molecule with the same direction, and the 2-step bi-directed edge is written as, Two edges e 2 αγ and e 2 γ α in equation (5) are equivalent, indicating it is same to apply edge merging operation on e 1 αβ and e 1 βγ , and to apply edge merging operation on e 1 γβ and e 1 βα . Figure 6 shows an example on edge merging operation.
Zero edge 0 is defined to indicate all non-existing bidirected edges. Note that 0 ⊗ e x αβ = 0, e x αβ ⊗ 0 = 0 . A z-step bi-directed edge can be obtained by, Definition 3: Multi-step Bi-directed Graph(MSG). A MSG derived from a read set S = {s 1 , s 2 , . . . , s h }, is written as, where g is the length of reference sequence w, Given  edge, and the corresponding k-molecule aˆor bˆas semiextended k-molecule. If e x αβ cannot be extended by any edge, this edge is called as full-extended edge, and k-molecule α and β are full-extended k-molecules. In Figure 5 and 6, semi-extended k-molecule and full-extended k-molecule are plotted with different colors (yellow for semi-extended k-molecules and blue for full-extended k-molecules), semi-extended edge and full-extended edge are drawn with different lines (broken line for semi-extended edges and real line for full-extended edges). Property 1. If the set of full-extended edges in the MSG defined in equation 7 is denoted as E * S , then the set of labels on all edges in E * S can be written as, and we have L * S = C , C is the set of contigs. The proof is presented in Appendix 1.
Property 2. Edge merging operation ⊗ over the multistep bi-directed edge set E S ∨ 0 is associative, and Q ( E S ∨ 0 ,⊗) is a semigroup. The proof is presented in Appendix 1.
The key property of 1-step bi-directed graph G 1 k (S) is that each read s corresponds to a path starting from the first k-molecule of s and ending at the last k-molecule. Similarly, each chromosome can also be regarded as a path. However because of sequencing gaps, read errors, and repeats in the set of reads, chromosome will be broken into pieces, or contigs. Within a MSG, each contig corresponds to one full-extended edge in G k (S), and this has been presented and proved in Property 1. Property 2 ensures that the edge merging operation ⊗ over the set of multi-step bi-directed edges has formed a semi-group, and this connects the standard genome assembly (SGA) problem with edge merging operations in semi-group. According to the associativity law of semi-group, the final full-extended edges or contigs will be the same as long as all edges have been merged regardless of the merging order, thus these edge merging operations can be computed in parallel. Finally in order to reconstruct the genome with a large set of contigs, we need to merge all semi-extended edges into full-extended edges in semi-group Q(E S ∨ 0 , ⊗).

SWAP computational framework
The lock-computing-unlock mechanism of SWAP was first introduced in our previous work [24], where no implementation details and complexity analysis is given. In this section, we present the mathematical description of the problems which can be solved by SWAP, then a scalable computational framework for SWAP model and its programming interface are presented. Its complexity and scalability is analyzed in Appendix 3.
Definition 4: small world of operations. Semi-group SG(A, R) is defined on set A with an associative operation R : A × A A. R(a i , a j ) is used to denote the associative operation on a i and a j , a i , a j ∈ A. The elements a i and a j , together with the operation R(a i , a j ) are grouped as a small world of the operation R(a i , a j ). We denote this small world as [a i , a j ], and [a i , a j ] = {R(a i , a j ), a i , a j }.
Activity ACT (A, σ) are given on a semi-group SG(A, R) as the computational works performed by a graph algorithm, where operation set s is a subset of R.
In real application, an operation corresponds to a basic operation of a given algorithm. For example, for MSG based genome assembly application, an operation can be defined as edge merging. For topological sorting, reordering a pair of vertices can be defined as an operation. For any two small worlds [a 1 , a 2 ], [b 1 , b 2 ], where a 1 ≠ b 1 , a 1 I= b 2 , a 2 ≠ b 1 , a 2 ≠ b 2 , the corresponding operations R(a 1 , a 2 ) and R(b 1 , b 2 ), can be computed independently, thus, there exists potential parallelism in computing activity induced from the semi-group SG(A, R). We use SWAP for such parallel computing. The basic schedule of SWAP is Lock-Computing-Unlock. In SWAP computational framework, all operations s in activity ACT (A, s) can be distributed among a group of processes. Each process needs to fetch related elements, for example a and b, to compute operation R(a, b). At the same time, this process also has to cooperate with other processes for sending or updating local variables. Each process should have two threads, one is SWAP thread, which performs computing tasks using the three-steps schedule of SWAP, and the other is service thread, which listens and replies to remote processes. In the implementation of our framework, we avoid multi-threads technology by using nonblocking communication and finite-state machine in each process to ensure its efficiency.
The activity ACT (A, s) on set A with operations in s can be treated as a graph G(s, A) with s as its vertices and A as its edges. Adjacent list is used to store the graph G(s, A), and a hash function hashF un(x) is used to distribute the set s into p subset for p processes, Figure 7. In Appendix 2, the pseudocodes of SWAP thread and service thread are demonstrated in Algorithm 1 and Algorithm 2, respectively.
Algorithm 1 describes the three-steps of SWAP computational framework, and Algorithm 2 on remote processers can cooperate with Algorithm 1 for running this schedule. The message functions, internal functions, and user-defined functions in Algorithm 1 and Algorithm 2 are listed in Table 2, where user-defined functions can be redefined for user-specific computational problems.
Similar to CSMA/CA in 802.11 protocol [27], Algorithm 1 adapts random back-off algorithm to avoid lock collision. A variety of backoff algorithms can be used, without loss of generality, binary exponential backoff [27] is used in SWAP thread. Note that all collided operations in s i share only one binary backoff, so the cost can be ignored as long as the number of relations in s i is huge.
The complexity and scalability analysis for SWAP computational framework are presented in Appendix 3. When the number of processes is less than the number of operations in s, which is true for most cases, equation (??) shows that SWAP computational framework can linearly scale up with the increasing number of cores. When the number of processes is larger than the number of operations, according to equation (??) the running time will be dominated by the communication round, which is bounded by log(d max ), where d max is the diameter of graph G(s, A).

Implementation of SWAP-Assembler
Based on MSG and SWAP computational framework, SWAP-Assembler consists with five steps, including input parallelization, graph construction, graph cleaning, graph reduction, and contig extension. Complexity analysis of SWAP-Assembler are presented in the end of this section.

Input parallelization
As the size of data generated by next generation sequencing technology generally has hundreds of Giga bytes, loading these data with one process costs hours to finish [14,16]. Similar to Ray [15] and YAGA [18], we use input parallelization to speedup the loading process. Given input reads with n nucleotides from a genome of size g, we divide the input file equally into p virtual data block, p is the number of processes. Each process reads the data located in its virtual data block only once. The computational complexity of this step is bounded by O( n p ) . For E.coli dataset of 4.4G bytes, SWAP-Assembler loads the data into memory in 4 seconds with 64 cores while YAGA uses 516.5 seconds [18], and for Yanhuang dataset SWAP-Assembler loads the data in 10 minutes while Ray costs 2 hour and 42 minutes.

Graph construction
This step aims to construct a 1-step bi-directed graph G 1 k (S) = {V s , E 1 s } , where V s and E 1 s are k-molecule set and 1-step bi-directed edge set. In this step, input sequences are broken into overlapping k-molecules by sliding a window of length k along the input sequence. An improvement to our previous work [24] is that the time usage on graph construction is overlapped with the previous step. As CPU computation and network communication can be performed when only partial data are loaded from the first step, they can be overlapped and combined as a pipeline. Computation and communication time used in this step are hid behind the time used on disk I/O in previous step.

Graph cleaning
This step cleans the erroneous k-molecules, based on the assumption that the erroneous k-mers have lower frequency compared with the correct ones [19]. Assuming that the errors are random, we identify the k-molecules with low frequency as erroneous k-molecules, and delete them from the vertex set. SWAP-Assembler also removes all edges with low frequency in the 1-step bi-directed graph, and the k-molecules without any attached edges. The frequency threshold can be set by users, or our method will calculate it automatically based on the average coverage of k-molecules. In our case, we prefer 3~10% of the average coverage as the threshold depending on the species.
All the operations in this step can be finished in O( n p ) parallel computation time, and about 60~80% of the k-molecules can be removed from our graph.

Graph reduction
In order to recover contigs, all semi-extended edges in MSG need to be merged into full-extended edges. This task can be defined as edge merging computing activity and denoted as ACT (Es ∨ 0, σ ), where the edge merging operation set s is, in which e u βα indicates an u-step bi-directed edge connecting two vertices b and a. All semi-extended edges of E s will be merged into full-extended edges finally.
In order to compute edge merging operations in s i using our SWAP computational framework, two userdefined functions in Table 2  . The proposed methods has much less computation round of O(loglog(g)) than YAGA with O(log(n)2) [18,19]. The detailed complexity analysis is provided in Appendix 4.

Contig extension
In order to extend the length of contigs while maintaining as less errors as possible, three types of special edges and two type of special vertices are processed in our method.
The first type of special edge is tip edge, which is connected with an terminal vertex and has a length less than k, where k is the k-mer size. These tip edges are deleted from the graph. The second type is self-loop edge, whose beginning vertex and terminal vertex are same. If this vertex has another edge which can be merged with this self-loop edge, they will be merged, otherwise it will be removed. The last type is multiple edge, whose two vertices are directly connected by two different edges. In this case the edge with lower coverage will be removed.
In addition, processing two special vertices can help further improve the quality of contigs. The first is cross vertex shown in Figure 8-(e), which has more than two edges on both sides. For each cross vertex, we sort all its edges according to their coverage. When the coverage difference between the two edges is less than 20%, then these two edges are merged as long as they can be merged regardless of other edges. The second vertex is virtual cross vertex shown in Figure 8-(f). We treat edge e 0 with its two end vertices as one virtual vertex A* and A* has more than two edges on both sides. All its edges are ranked according to their coverage. When the coverage difference between two edges on different nodes is less than 20%, these two edges will be merged with the edge e 0 regardless of other edges. By processing these two special vertices using the heuristic method, we can partly resolve some of the repeats satisfying our strict conditions at the cost of introducing errors and mismatches into contigs occasionally.
The graph reduction step and contig extension step need to be iterated in a constant number of rounds to extend full-extended edges or stop when no special edges and special vertices can be found. The number of errors and mismatches introduced in contigs can be controlled by the percentage of special edges and vertices processed in contig extension stage. In our method, we process all edges and vertices aiming at obtaining longer contigs. The computing complexity for contig extension step will be Equation (10) indicates that, for a given genome with fixed length g, the speedup is proportional to the number of processors; while for a given number of processors p, the speedup is inversely proportional to logarithm of the logarithm of the genome size g. However when the number of processors is larger than the length of longest path d max , the running time will be bounded by the number of communication round, which is presented in equation (??) in Appendix 4. In either situation, we can conclude that the scalability or the optimal number of cores will increase with larger genomes.

Results
SWAP-Assembler is a highly scalable and efficient genome assembler using multi-step bi-directed graph (MSG). In this section, we perform several computational experiments to evaluate the scalability and assembly quality of SWAP-Assembler. In the experiments, TianHe 1A [28] is used as the high performance cluster. 512 computing nodes are allocated for the experiment with 12 cores and 24GB memory on each node. By comparing with several state-of-the-art sequential and parallel assemblers, such as Velvet [22], SOAPdenovo [29], Pasha [16], ABySS [14] and Ray [15], we evaluate the scalability, quality of contigs in terms of N50, error rate and coverage for SWAP-Assembler.

Scalability evaluation
The scalability of our method is first evaluated on a share memory machine with 32 cores and 1T memory. Five other assemblers including Velvet, SOAPdenovo, Pasha, ABySS and Ray, are included for comparison. Only the first three small datasets in Table 3 are used in this test due to the memory limitation. The results are presented in Table 4, and the corresponding figures are plotted in Figure 9. According to Table 4, SWAP-Assembler has the lowest running time on all three datasets for 16 cores and 32 cores, and SOAPdenovo has the lowest time usage on 4 cores and 8 cores. According to Figure 9, SOAPdenovo, Ray and SWAP-Assembler can scale to 32 cores, however Pasha and ABySS can only scale to 16 cores. Figure 9 also shows that Ray and SWAP-Assembler can achieve nearly linear speedup and SWAP-Assembler is more efficient than Ray.
The time usage for each step of SWAP-Assembler on the share memory machine is also presented in Table 5 and Figure 10. The input parallelization step is overlapped with graph construction, thus we treat these two steps as one in this experiment. According to Table 5, for all three datasets the most time-consuming step is graph reduction, and the fastest steps are graph cleaning and contig extension. Figure 10 shows that input parallelization & graph construction, graph cleaning and graph reduction can achieve nearly linear speedup when the number of cores increases from 4 to 32 cores, whereas the contig extension step does not benefit as much as other steps.
To further evaluate the scalability of SWAP-Assembler from 64 to 4096 cores on TianHe 1A, we have compared our method with two parallel assemblers, ABySS and Ray, and the results are included in Table 6. According to Table 6 SWAP-Assembler is 119 times and 73 times faster than ABySS and Ray for 1024 cores on the S. aureus dataset. On the same dataset, ABySS and Ray cannot gain any speedup beyond 64 and 128 cores, respectively. However SWAP-Assembler scales up to 512 cores. For the R. sphaeroides dataset, ABySS, Ray, and SWAP-Assembler can scale up to 128, 256, and 1024 cores, respectively.
For three larger datasets, Table 6 shows that scalability of SWAP-Assembler is also better than the other two methods. On Hg14 dataset, SWAP-Assembler is 280 times faster than ABySS, and 38 times faster than Ray when using 1024 cores. Similar to the results on R. sphaeroides dataset, three assemblers still hold their turning point of scalability at 128, 256 and 1024 cores, respectively. Fish and Yanhuang dataset cannot be assembled by ABySS and Ray in 12 hours, so their running times are not recorded in Table 6. For 1024 cores, SWAP-Assembler assembles Fish dataset in 16 minutes, while it takes 26 minutes with 2048 cores to assemble the Yanhuang dataset. The speedup curves of SWAP-Assembler on processing five datasets are shown in Figure 11. It shows that the speedup of assembling two small datasets have a turning point at 512 cores, and linear speedup to 1024 cores is achieved for other three larger datasets. SWAP-Assembler can still benefits from the increasing cores up to 2048 cores on processing Yanhuang dataset.  Figure 9 Time usage comparison on a share memory machine for three small datasets (Time unit: seconds in logarithmic scale). In this test, the length of k-mer for all assemblers is set to be 31 and the k-mers cutoff threshold is set to be 3. For the three datasets, the sequencing data filtered by ALLPATH-LG is their input data. The time usage is recorded until the contig is generated. The horizontal axis is marked with the name of assemblers and the number of cores used.
Memory footprint is a bottleneck for assembling large genomes, and parallel assemblers is a solution for large genome assembly by using more memory on the computational nodes. For our case, Fish and Yanhuang genome assembly needs 1.6T bytes and 1.8T bytes memory, respectively. As in Tianhe 1A each server has 24G memory, Fish genomes cannot be assembled on a cluster with 64 servers. The same reasoning applies to Yanhuang dataset.
SWAP-Assembler has better scalability compared with Ray and ABySS due to two important improvements. Firstly, computational interdependence of edge merging operations on one single unanimous path is resolved by MSG. Secondly, SWAP computational framework can trigger parallel computation of all operations without interference, and the communication latency is hidden by improving the computing throughput. Ray and ABySS cannot merge the k-mers in a single linear chain in parallel, and PASHA can only parallelize the k-mer merging work on different chains, which limits their degree of parallelism. Figure 10 Time usage details of SWAP-Assembler's five steps on processing three small datasets using a share memory machine with 32 cores (Time unit: seconds in logarithmic scale). In this test, the length of k-mer for SWAP-Assembler is set to be 31 and the k-mers cutoff threshold is set to be 3. For the three datasets, the sequencing data filtered by ALLPATH-LG is their input data. The horizontal axis is marked with the name of datasets and the number of cores used.

Assembly quality assessment
This part evaluate the assembly quality of SWAP-Assembler. To be compatible with the comparison results from GAGE, we follow the error correction method of GAGE. ALLPATH-LG [35] and Quake [36] are used to correct errors for S. aureus and R. sphaeriodes datasets. The corrected reads are used as the input to ABySS, Ray and SWAP-Assembler. In addition, two other sequential genome assemblers, Velvet and SOAPdenovo, are selected in this experiment for comparison, and a machine with 1TB memory is used. The k-mer size for all assemblers varies between 19 and 31, and best assembly results from the experiments of different k-mer sizes for each assembler are reported in Table 7 and Table 8. Table 7 presents the results of four metrics for evaluation: the number of contigs, N50, number of erroneous contigs and error-corrected N50. Error-corrected N50 is used to exclude the misleading assessment of larger N50 by correcting erroneously concatenated contigs. Each erroneous contig is broken at every misjoin and at every indel longer than 5 bases. From Table 7, SWAP-Assembler generates 3 and 7 error contigs on S. aureus and R. sphaeriodes datasets, respectively, which are the smallest compared with other assemblers. N50 size and errorcorrected N50 size for SWAP-Assembler are also longer than two other parallel assemblers, Ray and ABySS. SOAPdenovo has minimal number of contigs and largest N50 size for both datasets. Table 8 summarizes the statistics of contigs generated for these two datasets. Three metrics in [30] are used to evaluate the quality of the contigs. In this table, "assembly size" close to its genome size is better. Larger "chaff size" can be indicative of the assembler being unable to integrate short repeat structures into larger contigs or not properly correcting erroneous bases. Coverage can be measured by the percentage of reference bases aligned to any assembled contig, which is "100%-Unaligned ref bases" [30]. SWAP-Assembler and Ray both have the smallest chaff size of 0.13% on R. sphaeroides dataset, and they show very close coverage and assembly sizes. In terms of S. aureus dataset, Ray has a lower chaff size of 0.10% compared with SWAP-Assembler, however, SWAP-Assembler generates better assembly size of 99.3% and larger coverage of 99.2%.
We also analyzed the contig statistics for three larger datasets and the results are presented in Table 9. Because these datasets do not have a standard reference set and the original script provided by GAGE requires a reference set, we wrote a script to analyze the assembly results using the number of contigs, N50 size, max length of contigs and bases in the contigs for evaluation. The original data of three datasets are directly processed by five assemblers with a fixed k-mer size of 31.
According to Table 9, the N50 size of contigs generated by SWAP-Assembler is longest for all three datasets. For Fish and Yanhuang datasets, SWAP-Assembler also performs best in the number of contigs and max length of contigs. However for SWAP-Assembler on Hg14 dataset, whose reads are extracted from the human dataset by mapping the human chromosome 14, the number of contigs, max length of contigs and bases in contigs have a rank of second, third, and second, respectively. SWAP-Assembler has a best N50 size for all datasets. This is because it has efficient graph cleaning and contig extension steps, which can handle sequencing errors efficiently. Four other assemblers, without the help from external tools on error correction, are affected by the quality of input reads on larger datasets.
In conclusion, SWAP-Assembler is a highly scalable and efficient genome assembler. The evaluation shows that our assembler can scales up to 2048 cores, which is much better than other parallel assemblers, and the quality of contigs generated by SWAP-Assembler is the best in terms of error rate for several small datasets and N50 size for two larger data sets.

Conclusion
In this paper, SWAP-Assembler, a fast and efficient genome assembler scaling up to thousands of cores, is presented. In SWAP-Assembler, two fundamental improvements are crucial for its scalability. Firstly, MSG  is presented as a comprehensive mathematical abstraction for genome assembly. With MSG the computational interdependence is resolved. Secondly, SWAP computational framework triggers the parallel computation of all operations without interference. Two additional steps are included to improve the quality of contigs. One is graph cleaning, which adopts the traditional methods of removing k-molecules and edges with low frequency, and the other is contig extension, which resolves special edges and some cross nodes with a heuristic method. Results show that SWAP-Assembler can scale up to 2048 cores on Yanhuang dataset using only 26 minutes, which is the best compared to other parallel assemblers, such as ABySS and Ray. Conitg evaluation results confirm that SWAP-Assembler can generate good results on contigs N50 size and retain low error rate. When processing massive datasets without using external error correction tools, SWAP-Assembler is immune from low data quality and generated longest N50 contig size. For large genome and metagenome data of Tara bytes, for example the human gut microbial community sequencing data, highly scalable and efficient assemblers will be essential for data analysis. Our future work will extend our algorithm development for massive matagenomics dataset with additional modules.
The program can be downloaded from https://sourceforge.net/projects/swapassembler.