SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores
- Jintao Meng^{1, 2, 3},
- Bingqiang Wang^{4},
- Yanjie Wei^{1}Email author,
- Shengzhong Feng^{1} and
- Pavan Balaji^{5}
https://doi.org/10.1186/1471-2105-15-S9-S2
© Meng et al.; licensee BioMed Central Ltd. 2014
Published: 10 September 2014
Abstract
Background
There is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data. Traditional assembly methods requiring long execution time and large amount of memory on a single workstation limit their use on these massive data.
Results
This paper presents a highly scalable assembler named as SWAP-Assembler for processing massive sequencing data using thousands of cores, where SWAP is an acronym for Small World Asynchronous Parallel model. In the paper, a mathematical description of multi-step bi-directed graph (MSG) is provided to resolve the computational interdependence on merging edges, and a highly scalable computational framework for SWAP is developed to automatically preform the parallel computation of all operations. Graph cleaning and contig extension are also included for generating contigs with high quality. Experimental results show that SWAP-Assembler scales up to 2048 cores on Yanhuang dataset using only 26 minutes, which is better than several other parallel assemblers, such as ABySS, Ray, and PASHA. Results also show that SWAP-Assembler can generate high quality contigs with good N50 size and low error rate, especially it generated the longest N50 contig sizes for Fish and Yanhuang datasets.
Conclusions
In this paper, we presented a highly scalable and efficient genome assembly software, SWAP-Assembler. Compared with several other assemblers, it showed very good performance in terms of scalability and contig quality. This software is available at: https://sourceforge.net/projects/swapassembler
Keywords
Background
To cope with massive sequence data generated by next-generation sequencing machines, a highly scalable and efficient parallel solution for fundamental bioinformatic applications is important [1, 2]. With the help of high performance computing, cloud computing [3, 4], and many-cores in GPU [5], successful scalable examples have been seen in many embarrassingly parallel applications: sequence alignment [6–8], SNP searching [9, 10], expression analysis [11], etc. However, for tightly coupled graph related problems, such as genome assembly, a scalable solution is a still a big challenge [12, 13].
State-of-the-art trials on parallel assemblers include ABySS [14], Ray [15], PASHA [16], and YAGA [17–19]. ABySS adopts the traditional de Bruijn graph data structure proposed by Pevzner et. al. [20] and follows the similar assembly strategy as EULER SR [21] and Velvet [22]. The parallelization is achieved by distributing k-mers to multi-servers to build a distributed de Bruijn graph, and error removal and graph reduction are implemented over MPI communication primitives. Ray extends k-mers (or seeds) into contigs with a heuristical greedy strategy by measuring the overlapping level of reads in both direction. Based on the observation that the time consuming part of genome assembly are generating and distributing k-mers, constructing and simplifying the distributed de Bruijn graph, PASHA concentrates its effort on parallelizing these two stages to improve its efficiency. However, PASHA allows only single process for each unanimous path, and this limits its degree of parallelism. In their experiments, ABySS and PASHA take about 87 hours and 21 hours to assembly the Yoruban male genome with a coverage of 42X.
To avoid merging k-mers on two different servers, which can result in too many small inter-process messages and the communication latency, YAGA constructs a distributed de Bruijn graph by maintaining edge tuples in a community of servers. Reducible edges belonging to one unanimous path are grouped into one server using a list rank algorithm [23], then these unanimous paths are reduced locally on separated servers. The complexity of YAGA is bounded by $O\left(\frac{n}{p}\right)$ computing time, $O\left(\frac{n}{p}\right)$ communication volume, and O(log^{2}(n)) communication rounds, where n is the number of nucleotides in all reads, and p denotes the number of processors. Due to the fact that the recursive list ranking algorithm used in YAGA has a memory usage of $O\left(\frac{n\text{log}n}{p}\right)$, this will use large amount of memory and cause low efficiency.
Our previous work [24] tries to avoid access collision of merging two neighbor edges. In this work, 1-step bi-directed graph and a computational model named as SWAP are proposed for edge merging operation. In its experiments, the prototype of edge merging algorithm using SWAP can scale to 640 cores on both Yeast and C.elegans dataset. However this exploratory work only focuses on the edge merging operation of genome assembly, some other important problems are not addressed, for example, contig extension, complexity analysis etc.
In this paper, we present a highly scalable and efficient genome assembler named as SWAP-Assembler, which can scale to thousands of cores on processing massive sequencing data such as Yanhuang (500G). SWAP-Assembler includes five fully parallelized steps: input parallelization, graph construction, graph cleaning, graph reduction and contig extension. Compared with our previous work, two fundamental improvements have been made for graph reduction. Firstly MSG is presented as a comprehensive mathematical abstraction for genome assembly. Using MSG and semi-group theory, the computational interdependence on merging edges is resolved. Secondly, we have developed a scalable computational framework for SWAP, this framework triggers the parallel computation of all operations with no interference. In this paper, complexity of this framework and SWAP-Assembler is also analyzed and proved in detail. In addition, two steps in SWAP-Assembler are used to improve the quality of contigs. One is graph cleaning, which adopts the traditional strategy of removing k-molecules and edges with low frequency, and the other one is contig extension, which resolves special edges and some cross nodes using a heuristic method. Experimental results show that SWAP-Assembler can scale up to 2048 cores on Yanhuang dataset using only 26 minutes, which is the fastest compared with other assemblers, such as ABySS, Ray and PASHA. Conitg evaluation results confirm that SWAP-Assembler generates good results on N50 size with lowest error rate for S. aureus and R. sphaeroides datasets. When processing larger datasets (Fish and Yanhuang) without using external error correction tools, SWAP-Assembler generates the longest N50 contig sizes of 1309 bp and 1368 bp for these two datasets.
Methods
In this section, our method for genome assembly towards thousands of cores is presented. We first abstract the genome assembly problem with MSG. Generating longer sequences (contigs) from shorter sequences corresponds to merging semi-extended edges to full-extended edges in MSG. In addition, computational interdependence of edge merging is resolved by introducing a semi-group over a closed edge set Es V 0 in MSG. The edge set Es V 0 is proved to be a semi-group with respect to edge merging operation. According to the associativity law of semi-group, the final results will be the same as long as all edges have been merged regardless of the merging order, thus these edge merging operations can be computed in parallel.
In order to maximally utilize the potential parallelism resolved by MSG, a scalable SWAP computational framework is developed. As one edge may be accessed by two merging operations in two different processes at the same time, a lock-computing-unlock mechanism introduced in [24] is adopted for avoiding the conflict. For the problems which can be abstracted with semi-group, the corresponding operations can be done in parallel, and SWAP computational framework can achieve linearly scale up for these problems.
Based on MSG and SWAP computational framework, SWAP-Assembler is developed with five steps, including input parallelization, graph construction, graph cleaning, graph reduction, and contig extension. In the following, we first present MSG and the SWAP computational framework, then details of SWAP-Assembler's five steps will be followed.
Mathematical formulation of genome assembly using MSG
Given a biological genome sample with reference sequence w ∈ ℕ^{ g } , where N = {A, T, C, G}, g = |w|, a large number of short sequences called reads, S = {s_{1}, s_{2}, ..., s_{ h }}, can be generated from the sequencing machines. Genome assembly is the process of reconstructing the reference genome sequence from these reads. Unfortunately, the genome assembly problem of finding the shortest string with all reads as its substring falls into a NP-hard problem [25].
Problem of Standard Genome Assembly (SGA)
Input: Given a set of reads without errors S = {s_{1}, s_{2}, ..., s_{ h }}
Output: A set of contigs C = {c_{1}, c_{2}, ..., c_{ w }}
Requirement: Each contig maps to an unanimous path in the De Bruijn graph constructed from the set of reads S.
Preliminaries
We first define some variables. Let s ∈ ℕ^{ l } be a DNA sequence of length l. Any substring derived from s with length k, is called a k-mer of s, and it is denoted by α = s[j]s[j + 1] . . . s[j + k − 1], 0 ≤ j < l − k + 1. The set of k-mers of a given string s can be written as Z(s, k), where k must be odd. The reverse complement of a k-mer α, denoted by $\alpha \prime $, is obtained by reversing α and complementing each base by the following bijection of M, M : {a → t, t→a, c→g, g → c}. Note that $\alpha ={\alpha}^{{\prime}^{\prime}}$.
List of notations.
Definition | Notation | Example |
---|---|---|
set of nucleotides | ℕ | ℕ = {A, T, C, G} |
reference sequence | W | w = "T AGT CGAGG" |
read set | S | S = { "T AGT CG", "AGT CGA", "T CGAGG" } |
k-mer | α or α^{ / } | α = "T AG", α^{ / } = "CT A" |
positive k-mer | _{ α }+ | α^{+} = "T AG" |
negative k-mer | _{ α } − | α^{ − } = "CT A" |
representative k-mer | _{ α }+ | α^{+} = "T AG" |
k-molecule | $\hat{\alpha}={\alpha}^{+}=\left\{{\alpha}^{+},{\alpha}^{-}\right\}$ | $\hat{\alpha}$ = {"T AG", "CT A"} |
set of k-mers | ℤ(s, k) | ℤ("T AGT CG", 3) ={"TAG","AGT","GTC","TCG"} |
set of k-molecules | S(s, k) | ℤ("T AGT C", 3) = {{"T AG", "CT A"}, {"AGT ", "ACT "}, {"GT C", "GAC"}, {"T CG", "CGA"}} |
1-step Bi-directed Graph
Equations (3) declares that any two overlapped k-molecules can be connected with one 1-step bi-directed edge when they are consecutive in sequence s or the complementary sequence s'. Here d_{ α } denotes the direction of k-mer α, if α = α^{+}, d_{ α } ='+', otherwise d_{ α } = '-'. ${c}_{\alpha \beta}^{1}$ is the content or label of the edge, and is initialized with β[k − 1], that is ${c}_{\alpha \beta}^{1}=\beta \left[k-1\right]$, and we have $su\phantom{\rule{2.77695pt}{0ex}}f\left({\alpha}^{\circ}{c}_{\alpha \beta}^{1},k\right)=\beta $.
- 1
${e}_{\alpha +\beta +}^{1}=\left({\alpha}^{+},{\beta}^{+},+,+,{c}_{\alpha +\beta +}^{1}\right),\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{e}_{\beta -\alpha -}^{1}=\left({\beta}^{-},{\alpha}^{-},-,-,{c}_{\beta -\alpha -}^{1}\right)$
- 2
${e}_{\alpha +\beta -}^{1}=\left({\alpha}^{+},{\beta}^{-},+,-,{c}_{\alpha +\beta -}^{1}\right),\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{e}_{\beta +\alpha -}^{1}=\left({\beta}^{+},{\alpha}^{-},+,=,{c}_{\beta +\alpha -}^{1}\right)$
- 3
${e}_{\alpha -\beta +}^{1}=\left({\alpha}^{-},{\beta}^{+},-,+,{c}_{\alpha -\beta -}^{1}\right),\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{e}_{\beta -\alpha +}^{1}=\left({\beta}^{-},{\alpha}^{+},-,+,{c}_{\beta -\alpha +}^{1}\right)$
- 4
${e}_{\alpha -\beta -}^{1}=\left({\alpha}^{-},{\beta}^{-},-,-,{c}_{\alpha -\beta -}^{1}\right),\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{e}_{\beta +\alpha +}^{1}=\left({\beta}^{+},\phantom{\rule{2.77695pt}{0ex}}{\alpha}^{+},+,+,{c}_{\beta +\alpha +}^{1}\right)$
Multi-step Bi-directed Graph and Its Properties
where g is the length of reference sequence w, ${E}_{si}^{j}+\left\{{e}_{\alpha \beta}^{j}\text{|}\forall \hat{\alpha},\hat{\beta}\in \mathbb{S}\left({s}_{i},k\right)\right\}$. A MSG is obtained through edge merging operations.
Given an x-step bi-directed edge ${e}_{\alpha \beta}^{x}=\left(\alpha ,\beta ,{d}_{\alpha},{d}_{\beta},{c}_{\alpha \beta}^{x}\right)$, if there exists edge ${e}_{\gamma \alpha}^{y}$ or ${e}_{\beta \gamma}^{z}$ satisfying ${e}_{\gamma \alpha}^{y}\otimes {e}_{\alpha \beta}^{x}\ne 0$ or ${e}_{\alpha \beta}^{x}\otimes {e}_{\beta \gamma}^{z}\ne 0$, then we call edge ${e}_{\alpha \beta}^{x}$ as a semi-extended edge, and the corresponding k-molecule α ˆ or β ˆ as semi-extended k -molecule. If ${e}_{\alpha \beta}^{x}$ cannot be extended by any edge, this edge is called as full-extended edge, and k-molecule $\hat{\alpha}$ and $\hat{\beta}$ are full-extended k -molecules. In Figure 5 and 6, semi-extended k-molecule and full-extended k-molecule are plotted with different colors (yellow for semi-extended k-molecules and blue for full-extended k-molecules), semi-extended edge and full-extended edge are drawn with different lines (broken line for semi-extended edges and real line for full-extended edges).
and we have ${L}_{S}^{*}=C$, C is the set of contigs. The proof is presented in Appendix 1.
Property 2. Edge merging operation ⊗ over the multi-step bi-directed edge set ${E}_{S}\vee 0$ is associative, and Q(${E}_{S}\vee 0$,⊗) is a semigroup. The proof is presented in Appendix 1.
The key property of 1-step bi-directed graph ${G}_{k}^{1}\left(S\right)$ is that each read s corresponds to a path starting from the first k-molecule of s and ending at the last k-molecule. Similarly, each chromosome can also be regarded as a path. However because of sequencing gaps, read errors, and repeats in the set of reads, chromosome will be broken into pieces, or contigs. Within a MSG, each contig corresponds to one full-extended edge in ${G}_{k}\left(S\right)$, and this has been presented and proved in Property 1. Property 2 ensures that the edge merging operation ⊗ over the set of multi-step bi-directed edges has formed a semi-group, and this connects the standard genome assembly (SGA) problem with edge merging operations in semi-group. According to the associativity law of semi-group, the final full-extended edges or contigs will be the same as long as all edges have been merged regardless of the merging order, thus these edge merging operations can be computed in parallel. Finally in order to reconstruct the genome with a large set of contigs, we need to merge all semi-extended edges into full-extended edges in semi-group Q(${E}_{S}\vee 0$, ⊗).
SWAP computational framework
The lock-computing-unlock mechanism of SWAP was first introduced in our previous work [24], where no implementation details and complexity analysis is given. In this section, we present the mathematical description of the problems which can be solved by SWAP, then a scalable computational framework for SWAP model and its programming interface are presented. Its complexity and scalability is analyzed in Appendix 3.
Definition 4: small world of operations. Semi-group SG(A, R) is defined on set A with an associative operation R : A × A → A. R(a_{ i }, a_{ j }) is used to denote the associative operation on a_{ i } and a_{ j }, a_{ i }, a_{ j } ∈ A. The elements a_{ i } and a_{ j }, together with the operation R(a_{ i }, a_{ j } ) are grouped as a small world of the operation R(a_{ i }, a_{ j }). We denote this small world as [a_{ i }, a_{ j }], and [a_{ i }, a_{ j }] = {R(a_{ i }, a_{ j }), a_{ i }, a_{ j }}.
Activity ACT (A, σ) are given on a semi-group SG(A, R) as the computational works performed by a graph algorithm, where operation set σ is a subset of R.
In real application, an operation corresponds to a basic operation of a given algorithm. For example, for MSG based genome assembly application, an operation can be defined as edge merging. For topological sorting, re-ordering a pair of vertices can be defined as an operation.
- 1
Lock action is applied to lock R(a, b)'s small world [a, b].
- 2
Computing is performed for operation R(a, b), and the values of a, b are updated accordingly. In MSG, this corresponds to merging two edges.
- 3
Unlock action is triggered to release operation R(a, b)!s small world [a, b].
In SWAP computational framework, all operations σ in activity ACT (A, σ) can be distributed among a group of processes. Each process needs to fetch related elements, for example a and b, to compute operation R(a, b). At the same time, this process also has to cooperate with other processes for sending or updating local variables. Each process should have two threads, one is SWAP thread, which performs computing tasks using the three-steps schedule of SWAP, and the other is service thread, which listens and replies to remote processes. In the implementation of our framework, we avoid multi-threads technology by using nonblocking communication and finite-state machine in each process to ensure its efficiency.
Description of message functions, internal functions, and user-defined functions used in Algorithm 1 and 2.
Class | Function name | Function description |
---|---|---|
Message functions | Msg Lock(a, p) | Lock a in process p |
Msg Unlock(a, p) | Unlock a in process p | |
Msg Read(a, p) | Fetch associated values of a | |
Msg Write(a, newa, p) | Update associated values of a with newa | |
Msg Locksuccess(a, R(a, b), p) | Send Locksuccess Message back to R(a, b) | |
Msg Lockfailed(a, R(a, b), p) | Send Lockfailed Message back to R(a, b) | |
Msg ReadBack(a, R(a, b), p) | Send associated value of a back to R(a, b) | |
Msg End() | Command to stop the service thread | |
Internal functions | proc(a) | Get process ID of a |
trylock(a) | Lock a | |
unlock(a) | Unlock a | |
User-defined functions | GetSmallWorld( R(a, b) ) | Get small world [a, b] from operation R(a, b) |
Operation(a,b) | Compute the operation R(a, b) |
Similar to CSMA/CA in 802.11 protocol [27], Algorithm 1 adapts random back-off algorithm to avoid lock collision. A variety of backoff algorithms can be used, without loss of generality, binary exponential backoff [27] is used in SWAP thread. Note that all collided operations in σ_{ i } share only one binary backoff, so the cost can be ignored as long as the number of relations in σ_{ i } is huge.
The complexity and scalability analysis for SWAP computational framework are presented in Appendix 3. When the number of processes is less than the number of operations in σ, which is true for most cases, equation (??) shows that SWAP computational framework can linearly scale up with the increasing number of cores. When the number of processes is larger than the number of operations, according to equation (??) the running time will be dominated by the communication round, which is bounded by log(d_{ max }), where d_{ max } is the diameter of graph G(σ, A).
Implementation of SWAP-Assembler
Based on MSG and SWAP computational framework, SWAP-Assembler consists with five steps, including input parallelization, graph construction, graph cleaning, graph reduction, and contig extension. Complexity analysis of SWAP-Assembler are presented in the end of this section.
Input parallelization
As the size of data generated by next generation sequencing technology generally has hundreds of Giga bytes, loading these data with one process costs hours to finish [14, 16]. Similar to Ray [15] and YAGA [18], we use input parallelization to speedup the loading process. Given input reads with n nucleotides from a genome of size g, we divide the input file equally into p virtual data block, p is the number of processes. Each process reads the data located in its virtual data block only once. The computational complexity of this step is bounded by $O\left(\frac{n}{p}\right)$. For E.coli dataset of 4.4G bytes, SWAP-Assembler loads the data into memory in 4 seconds with 64 cores while YAGA uses 516.5 seconds [18], and for Yanhuang dataset SWAP-Assembler loads the data in 10 minutes while Ray costs 2 hour and 42 minutes.
Graph construction
This step aims to construct a 1-step bi-directed graph ${G}_{k}^{1}\left(S\right)=\left\{{V}_{s},{E}_{s}^{1}\right\}$, where V_{ s } and ${E}_{s}^{1}$ are k-molecule set and 1-step bi-directed edge set. In this step, input sequences are broken into overlapping k-molecules by sliding a window of length k along the input sequence. A k-molecule can have up to eight edges, and each edge corresponds to a possible one-base extension, {A, C, G, T} in either direction. The adjacent k-molecule can be easily obtained by adding the base extension to the source k-molecule. The generated graph has O(n) k-molecules and O(n) bi-directed edges distributed among p processors. Graph construction of 1-step bi-directed graph can be achieved in $O\left(\frac{n}{p}\right)$ parallel computing time, and $O\left(\frac{n}{p}\right)$ parallel communication volume.
An improvement to our previous work [24] is that the time usage on graph construction is overlapped with the previous step. As CPU computation and network communication can be performed when only partial data are loaded from the first step, they can be overlapped and combined as a pipeline. Computation and communication time used in this step are hid behind the time used on disk I/O in previous step.
Graph cleaning
This step cleans the erroneous k-molecules, based on the assumption that the erroneous k-mers have lower frequency compared with the correct ones [19]. Assuming that the errors are random, we identify the k-molecules with low frequency as erroneous k-molecules, and delete them from the vertex set. SWAP-Assembler also removes all edges with low frequency in the 1-step bi-directed graph, and the k-molecules without any attached edges. The frequency threshold can be set by users, or our method will calculate it automatically based on the average coverage of k-molecules. In our case, we prefer 3 ~ 10% of the average coverage as the threshold depending on the species.
All the operations in this step can be finished in $O\left(\frac{n}{p}\right)$ parallel computation time, and about 60 ~ 80% of the k-molecules can be removed from our graph.
Graph reduction
in which ${e}_{\beta \alpha}^{u}$ indicates an u-step bi-directed edge connecting two vertices β and α. All semi-extended edges of E_{ s } will be merged into full-extended edges finally.
In order to compute edge merging operations in σ_{ i } using our SWAP computational framework, two user-defined functions in Table 2 are described as Algorithm 3 and Algorithm 4 in Appendix 2. For each process, the edge merging step has a computing complexity of $O\left(\frac{n}{p}\right)$, communication volume of $\left(\frac{n\phantom{\rule{2.77695pt}{0ex}}\text{log}\left(\text{log}\left(g\right)\right)}{p}\right)$, and communication round of O(log(log(g))). The proposed methods has much less computation round of O(loglog(g)) than YAGA with O(log(n)2) [18, 19]. The detailed complexity analysis is provided in Appendix 4.
Contig extension
In order to extend the length of contigs while maintaining as less errors as possible, three types of special edges and two type of special vertices are processed in our method.
The first type of special edge is tip edge, which is connected with an terminal vertex and has a length less than k, where k is the k-mer size. These tip edges are deleted from the graph. The second type is self-loop edge, whose beginning vertex and terminal vertex are same. If this vertex has another edge which can be merged with this self-loop edge, they will be merged, otherwise it will be removed. The last type is multiple edge, whose two vertices are directly connected by two different edges. In this case the edge with lower coverage will be removed.
Equation (10) indicates that, for a given genome with fixed length g, the speedup is proportional to the number of processors; while for a given number of processors p, the speedup is inversely proportional to logarithm of the logarithm of the genome size g. However when the number of processors is larger than the length of longest path d_{ max }, the running time will be bounded by the number of communication round, which is presented in equation (??) in Appendix 4. In either situation, we can conclude that the scalability or the optimal number of cores will increase with larger genomes.
Results
SWAP-Assembler is a highly scalable and efficient genome assembler using multi-step bi-directed graph (MSG). In this section, we perform several computational experiments to evaluate the scalability and assembly quality of SWAP-Assembler. In the experiments, TianHe 1A [28] is used as the high performance cluster. 512 computing nodes are allocated for the experiment with 12 cores and 24GB memory on each node. By comparing with several state-of-the-art sequential and parallel assemblers, such as Velvet [22], SOAPdenovo [29], Pasha [16], ABySS [14] and Ray [15], we evaluate the scalability, quality of contigs in terms of N50, error rate and coverage for SWAP-Assembler.
Experimental data
Details about the five short read datasets.
- | S. aureus | R. sphaeroides | Hg14 | Fish | YanHuang |
---|---|---|---|---|---|
Fastq data size (Gbytes) | 0.684 | 0.906 | 14.2 | 425 | 495 |
Read length (bp) | 37, 101 | 101 | 101 | 101 | 80-120 |
No. of reads (million) | 4.8 | 4.1 | 62 | 1910 | 1859 |
Coverage | 90X | 90X | 70X | 192X | 57X |
Reference size (Mbp) | 2.90 | 4.60 | 88.6 | 1000 | 3000 |
Scalability evaluation
Time usage results of three small datasets on a share memory machine with 32 cores.
Assembler | Cores | S. aureus | R. sphaeroides | Hg14 |
---|---|---|---|---|
Velvet | 1 | 159 | 265 | 5432 |
SOAPdenovo | 4 | 44 | 71 | 1004 |
8 | 44 | 69 | 933 | |
16 | 36 | 57 | 742 | |
32 | 38 | 45 | 584 | |
Pasha | 4 | 215 | 342 | 5494 |
8 | 159 | 255 | 3938 | |
16 | 147 | 255 | 3436 | |
32 | 183 | 289 | 4852 | |
ABySS | 4 | 174 | 302 | 4138 |
8 | 146 | 234 | 3079 | |
16 | 139 | 226 | 2588 | |
32 | 147 | 235 | 2596 | |
Ray | 4 | 1247 | 1778 | 24145 |
8 | 707 | 1050 | 13116 | |
16 | 454 | 688 | 7222 | |
32 | 351 | 467 | 4235 | |
SWAP-Assembler | 4 | 81 | 129 | 2167 |
8 | 42 | 69 | 1187 | |
16 | 24 | 38 | 685 | |
32 | 13 | 21 | 408 |
Time usage details of SWAP-Assembler's five steps on processing three small datasets using a share memory machine with 32 cores.
Steps | 4 cores | 8 cores | 16 cores | 32 cores | |
---|---|---|---|---|---|
S. aureus | input parallelization & graph construction | 31.29 | 16.32 | 9.55 | 4.91 |
graph cleaning | 2.14 | 1.07 | 0.6 | 0.3 | |
graph reduction | 45.75 | 24.2 | 13.23 | 7.67 | |
contig extension | 1.54 | 0.89 | 0.6 | 0.52 | |
R. sphaeroides | input parallelization & graph construction | 49.19 | 26.42 | 14.88 | 7.79 |
graph cleaning | 3.63 | 1.99 | 0.97 | 0.54 | |
graph reduction | 73.78 | 39.31 | 20.83 | 12.19 | |
contig extension | 2.37 | 1.33 | 0.85 | 0.64 | |
Hg14 | input parallelization & graph construction | 628 | 327 | 184 | 98 |
graph cleaning | 71 | 37 | 19 | 10 | |
graph reduction | 1328 | 734 | 413 | 242 | |
contig extension | 140 | 89 | 69 | 58 |
Scalability evaluation on parallel assemblers 1 (Time unit: seconds).
dataset | software | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
---|---|---|---|---|---|---|---|---|
S. aureus (2.87 Mb) | ABySS | 248 | 269 | 334 | 554 | 831 | _{-2} | - |
Ray | 244 | 198 | 202 | 266 | 510 | - | - | |
SWAP-Assembler | 23 | 15 | 8 | 5 | 7 | 13 | - | |
R. sphaeroides (4.60 Mb) | ABySS | 492 | 454 | 522 | 718 | 1312 | - | - |
Ray | 287 | 183 | 181 | 190 | 285 | - | - | |
SWAP-Assembler | 43 | 29 | 15 | 7 | 7 | 7 | - | |
Hg14 (88.29 Mb) | ABySS | 6472 | 5299 | 6935 | 9045 | 16530 | - | - |
Ray | 2926 | 1746 | 1288 | 1517 | 2266 | - | - | |
SWAP-Assembler | 585 | 428 | 203 | 128 | 59 | 67 | - | |
Fish (1 Gb | ABySS | _{*} 3 | _{4} | + | + | + | - | - |
Ray | * | + | + | + | + | - | - | |
SWAP-Assembler | + | 13941 | 8622 | 3263 | 962 | 971 | 2582 | |
Yanhuang (3 Gb) | ABySS | * | + | + | + | + | - | - |
Ray | * | + | + | + | + | - | - | |
SWAP-Assembler | * | 11243 | 5761 | 4021 | 1783 | 1608 | 1778 |
Memory footprint is a bottleneck for assembling large genomes, and parallel assemblers is a solution for large genome assembly by using more memory on the computational nodes. For our case, Fish and Yanhuang genome assembly needs 1.6T bytes and 1.8T bytes memory, respectively. As in Tianhe 1A each server has 24G memory, Fish genomes cannot be assembled on a cluster with 64 servers. The same reasoning applies to Yanhuang dataset.
SWAP-Assembler has better scalability compared with Ray and ABySS due to two important improvements. Firstly, computational interdependence of edge merging operations on one single unanimous path is resolved by MSG. Secondly, SWAP computational framework can trigger parallel computation of all operations without interference, and the communication latency is hidden by improving the computing throughput. Ray and ABySS cannot merge the k-mers in a single linear chain in parallel, and PASHA can only parallelize the k-mer merging work on different chains, which limits their degree of parallelism.
Assembly quality assessment
Assembly results of S.aureus and R. sphaeriodes datasets.
Software | S. aureus | Contigs | R. sphaeroides Contigs | |||||
---|---|---|---|---|---|---|---|---|
Num | N50(kb) | Errors | N50 corr. (kb) | Num | N50(kb) | Errors | N50 corr.(kb) | |
Velvet | 162 | 48.4 | 28 | 41.5 | 583 | 15.7 | 35 | 14.5 |
SOAPdenovo | 107 | 288.2 | 48 | 62.7 | 204 | 131.7 | 414 | 14.3 |
ABySS | 302 | 29.2 | 14 | 24.8 | 1915 | 5.9 | 55 | 4.2 |
Ray | 221 | 36.6 | 15 | 35.6 | 752 | 11.5 | 17 | 11.2 |
SWAP-Assembler | 183 | 51.1 | 3 | 37.8 | 529 | 16.2 | 7 | 12.3 |
Contig statistics on the assembly results of S.aureus, R. sphaeroides datasets. (Unit: Percentage %).
Dataset | Software | Assembly | Chaff | Unalign | Unalign | Duplicate | Compress |
---|---|---|---|---|---|---|---|
Size | Size | Ref bases | ASM bases | Ref bases | Ref bases | ||
S. aureus | Velvet | 99.2 | 0.45 | 0.79 | 0.03 | 0.1 | 1.28 |
(2.9 Mb) | SOAPdenovo | 101.3 | 0.35 | 0.38 | 0.01 | 1.44 | 1.41 |
ABySS | 127 | 66 | 1.37 | < 0.01 | 23.3 | 0.99 | |
Ray | 98.4 | 0.1 | 0.88 | 0.04 | 0.08 | 1.26 | |
SWAP-Assembler | 99.3 | 0.28 | 0.8 | 0.02 | 1.29 | 1.45 | |
Velvet | 97.8 | 0.54 | 1.6 | 0.01 | 0.29 | 0. 92 | |
SOAPdenovo | 99.9 | 0.45 | 0.88 | 0.02 | 1.07 | 0.51 | |
R. sphaeroides (4.6 Mb) | ABySS | 108 | 1.65 | 3.01 | 0.15 | 10.04 | 0.04 |
Ray | 99 | 0.13 | 1.03 | < 0.01 | 0.27 | 0.73 | |
SWAP-Assembler | 99.1 | 0.13 | 1.08 | 0.11 | 0.83 | 0.75 |
Table 7 presents the results of four metrics for evaluation: the number of contigs, N50, number of erroneous contigs and error-corrected N50. Error-corrected N50 is used to exclude the misleading assessment of larger N50 by correcting erroneously concatenated contigs. Each erroneous contig is broken at every misjoin and at every indel longer than 5 bases. From Table 7, SWAP-Assembler generates 3 and 7 error contigs on S. aureus and R. sphaeriodes datasets, respectively, which are the smallest compared with other assemblers. N50 size and error-corrected N50 size for SWAP-Assembler are also longer than two other parallel assemblers, Ray and ABySS. SOAPdenovo has minimal number of contigs and largest N50 size for both datasets.
Table 8 summarizes the statistics of contigs generated for these two datasets. Three metrics in [30] are used to evaluate the quality of the contigs. In this table, "assembly size" close to its genome size is better. Larger "chaff size" can be indicative of the assembler being unable to integrate short repeat structures into larger contigs or not properly correcting erroneous bases. Coverage can be measured by the percentage of reference bases aligned to any assembled contig, which is "100%-Unaligned ref bases" [30]. SWAP-Assembler and Ray both have the smallest chaff size of 0.13% on R. sphaeroides dataset, and they show very close coverage and assembly sizes. In terms of S. aureus dataset, Ray has a lower chaff size of 0.10% compared with SWAP-Assembler, however, SWAP-Assembler generates better assembly size of 99.3% and larger coverage of 99.2%.
Contig statistics of Hg14, Fish and Yanhuang datasets.
Dataset | Software | Contigs | |||
---|---|---|---|---|---|
Num | N50 (bp) | Max (bp) | BasesInContig (Mbp) | ||
Velvet | 90,784 | 1688 | 25,729 | 83.3 | |
Hg 14 (88.3 Mb) | SOAPdenovo | 200,153 | 836 | 21,144 | 96.4 |
ABySS | 190,693 | 1914 | 26,697 | 107.4 | |
Ray | 76,950 | 964 | 14,399 | 68.4 | |
SWAP-Assembler | 88,609 | 2036 | 21,246 | 96.4 | |
Fish (1 Gb) | Velvet | - | - | - | - |
SOAPdenovo | 3291290 | 378 | 7181 | 1,134.40 | |
ABySS | - | - | - | - | |
Ray | - | - | - | - | |
SWAP-Assembler | 2881443 | 1309 | 35,962 | 1,097.90 | |
Velvet | - | - | - | - | |
Yanhuang (3 Gb) | SOAPdenovo | 8,584,515 | 841 | 23,782 | 3396.2 |
ABySS | 9,218,967 | 1059 | 24,428 | 3691.8 | |
Ray | 3,755,103 | 266 | 6,765 | 1620.1 | |
SWAP-Assembler | 2,379,151 | 1368 | 31,152 | 2434.3 |
In conclusion, SWAP-Assembler is a highly scalable and efficient genome assembler. The evaluation shows that our assembler can scales up to 2048 cores, which is much better than other parallel assemblers, and the quality of contigs generated by SWAP-Assembler is the best in terms of error rate for several small datasets and N50 size for two larger data sets.
Conclusion
In this paper, SWAP-Assembler, a fast and efficient genome assembler scaling up to thousands of cores, is presented. In SWAP-Assembler, two fundamental improvements are crucial for its scalability. Firstly, MSG is presented as a comprehensive mathematical abstraction for genome assembly. With MSG the computational interdependence is resolved. Secondly, SWAP computational framework triggers the parallel computation of all operations without interference. Two additional steps are included to improve the quality of contigs. One is graph cleaning, which adopts the traditional methods of removing k-molecules and edges with low frequency, and the other is contig extension, which resolves special edges and some cross nodes with a heuristic method. Results show that SWAP-Assembler can scale up to 2048 cores on Yanhuang dataset using only 26 minutes, which is the best compared to other parallel assemblers, such as ABySS and Ray. Conitg evaluation results confirm that SWAP-Assembler can generate good results on contigs N50 size and retain low error rate. When processing massive datasets without using external error correction tools, SWAP-Assembler is immune from low data quality and generated longest N50 contig size.
For large genome and metagenome data of Tara bytes, for example the human gut microbial community sequencing data, highly scalable and efficient assemblers will be essential for data analysis. Our future work will extend our algorithm development for massive matagenomics dataset with additional modules.
The program can be downloaded from https://sourceforge.net/projects/swapassembler.
Additional file 1
Appendix. Appendix 1 property proof of MSG, Appendix 2 algorithms for SWAP-Assembler, Appendix 3 complexity analysis of SWAP computational framework, Appendix 4 complexity analysis of graph reduction.
Declarations
Acknowledgements
The authors would like to thank for the suggestions and modifications from Francis YL Chin, SM Yiu, C.M. Leung, Yu Peng and Yi Wang from the University of Hongkong, and anonymous reviewers invited by Recomb-seq 2014. The calculations of this work were performed on TianHe-1A at National Supercomputer Center in Tianjin.
Declarations
Funding for the publication of this article was provided by National Science Foundation of China under grant No. 11204342, the Science Technology and Innovation Committee of Shenzhen Municipality under grant No. JCYJ20120615140912201, and Shenzhen Peacock Plan under grant No. KQCX20130628112914299.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 9, 2014: Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S9.
Authors’ Affiliations
References
- Schatz MC, Langmead B, Salzberg SL: Cloud computing and the dna data race. Nature biotechnology. 2010, 28 (7): 691-10.1038/nbt0710-691.PubMed CentralView ArticlePubMedGoogle Scholar
- Stein LD: The case for cloud computing in genome informatics. Genome Biol. 2010, 11 (5): 207-10.1186/gb-2010-11-5-207.PubMed CentralView ArticlePubMedGoogle Scholar
- Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I: A view of cloud computing. Communications of the ACM. 2010, 53 (4): 50-58. 10.1145/1721654.1721672.View ArticleGoogle Scholar
- Dean J, Ghemawat S: Mapreduce: simplified data processing on large clusters. Communications of the ACM. 2008, 51 (1): 107-113. 10.1145/1327452.1327492.View ArticleGoogle Scholar
- Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC: Gpu computing. Proceedings of the IEEE. 2008, 96 (5): 879-899. 10.1109/JPROC.2008.917757.View ArticleGoogle Scholar
- Matsunaga A, Tsugawa M, Fortes J: Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. eScience, 2008 eScience'08 IEEE Fourth International Conference On. 2008, IEEE, 222-229. 10.1109/eScience.2008.62.View ArticleGoogle Scholar
- Wall DP, Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, Tonellato PJ: Cloud computing for comparative genomics. BMC bioinformatics. 2010, 11 (1): 259-10.1186/1471-2105-11-259.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu CM, Wong T, Wu E, Luo R, Yiu SM, Li Y, Wang B, Yu C, Chu X, Zhao K: Soap3: ultra-fast gpu-based parallel alignment tool for short reads. Bioinformatics. 2012, 28 (6): 878-879. 10.1093/bioinformatics/bts061.View ArticlePubMedGoogle Scholar
- Schatz MC: Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics. 2009, 25 (11): 1363-1369. 10.1093/bioinformatics/btp236.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL: Searching for snps with cloud computing. Genome Biol. 2009, 10 (11): 134-10.1186/gb-2009-10-11-r134.View ArticleGoogle Scholar
- Langmead B, Hansen KD, Leek JT: Cloud-scale rna-sequencing differential expression analysis with myrna. Genome Biol. 2010, 11 (8): 83-10.1186/gb-2010-11-8-r83.View ArticleGoogle Scholar
- McPherson JD: Next-generation gap. Nature Methods. 2009, 6: 2-5.View ArticleGoogle Scholar
- Shendure J, Ji H: Next-generation dna sequencing. Nature biotechnology. 2008, 26 (10): 1135-1145. 10.1038/nbt1486.View ArticlePubMedGoogle Scholar
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol : Abyss: a parallel assembler for short read sequence data. Genome research. 2009, 19 (6): 1117-1123. 10.1101/gr.089532.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Boisvert S, Laviolette F, Corbeil J: Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Journal of Computational Biology. 2010, 17 (11): 1519-1533. 10.1089/cmb.2009.0238.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Y, Schmidt B, Maskell DL: Parallelized short read assembly of large genomes using de bruijn graphs. BMC bioinformatics. 2011, 12 (1): 354-10.1186/1471-2105-12-354.PubMed CentralView ArticlePubMedGoogle Scholar
- Jackson BG, Aluru S: Parallel construction of bidirected string graphs for genome assembly. Parallel Processing. 2008, ICPP'08. 37th International Conference On, pp. 346-353 (2008). IEEEGoogle Scholar
- Jackson BG, Schnable PS, Aluru S: Parallel short sequence assembly of transcriptomes. BMC bioinformatics. 2009, 10 (Suppl 1): 14-10.1186/1471-2105-10-S1-S14.View ArticleGoogle Scholar
- Jackson BG, Regennitter M, Yang X, Schnable PS, Aluru S: Parallel de novo assembly of large genomes from high-throughput short reads. Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium On. 2010, 1-10. IEEEView ArticleGoogle Scholar
- Pevzner PA, Tang H, Waterman MS: An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences. 2001, 98 (17): 9748-9753. 10.1073/pnas.171285098.View ArticleGoogle Scholar
- Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome research. 2008, 18 (2): 324-330. 10.1101/gr.7088808.PubMed CentralView ArticlePubMedGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMed CentralView ArticlePubMedGoogle Scholar
- Dehne F, Song SW: Randomized parallel list ranking for distributed memory multiprocessors. International journal of parallel programming. 1997, 25 (1): 1-16. 10.1007/BF02700044.View ArticleGoogle Scholar
- Meng J, Yuan J, Cheng J, Wei Y, Feng S: Small world asynchronous parallel model for genome assembly. Network and Parallel Computing. 2012, Springer, Gwangju, 145-155.View ArticleGoogle Scholar
- Pop M, Salzberg SL, Shumway M: Genome sequence assembly: Algorithms and issues. Computer. 2002, 35 (7): 47-54. 10.1109/MC.2002.1016901.View ArticleGoogle Scholar
- Kapun E, Tsarev F: De bruijn superwalk with multiplicities problem is np-hard. BMC bioinformatics. 2013, 14 (Suppl 5): 7-Google Scholar
- Andrew S: Computer Networks. 2003, Prentice Hall, The NetherlandsGoogle Scholar
- Yang XJ, Liao XK, Lu K, Hu QF, Song JQ, Su JS: The tianhe-1a supercomputer: its hardware and software. Journal of Computer Science and Technology. 2011, 26 (3): 344-351. 10.1007/s02011-011-1137-8.View ArticleGoogle Scholar
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K: De novo assembly of human genomes with massively parallel short read sequencing. Genome research. 2010, 20 (2): 265-272. 10.1101/gr.097261.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M: Gage: A critical evaluation of genome assemblies and assembly algorithms. Genome Research. 2012, 22 (3): 557-567. 10.1101/gr.131383.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert10 S, Chapman JA, Chapuis G, Chikhi R: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. arXiv preprint arXiv:1301.5406. 2013Google Scholar
- Assemblathon 2. [http://assemblathon.org/assemblathon2]
- Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y: Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012, 1 (1): 18-10.1186/2047-217X-1-18.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang Yan: The First Asian Diploid Genome. [http://yh.genomics.org.cn]
- Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences. 2011, 108 (4): 1513-1518. 10.1073/pnas.1017351108.View ArticleGoogle Scholar
- Kelley DR, Schatz MC, Salzberg SL: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 2010, 11 (11): 116-10.1186/gb-2010-11-11-r116.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Comments
View archived comments (1)